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class = "introduction" We encounter statistics in our 
daily lives more often than we probably realize and 
from many different sources, like the news. (credit: 
David Sim) 


Chapter objective 
By the end of this chapter, the student should be 
able to: 


* Recognize and differentiate between key 
terms. 

* Apply various types of sampling methods to 
data collection. 


You are probably asking yourself the question, 
"When and where will I use statistics?" If you read 
any newspaper, watch television, or use the 


Internet, you will see statistical information. There 
are statistics about crime, sports, education, politics, 
and real estate. Typically, when you read a 
newspaper article or watch a television news 
program, you are given sample information. With 
this information, you may make a decision about 
the correctness of a statement, claim, or "fact." 
Statistical methods can help you make the "best 
educated guess." 


Since you will undoubtedly be given statistical 
information at some point in your life, you need to 
know some techniques for analyzing the information 
thoughtfully. Think about buying a house or 
managing a budget. Think about your chosen 
profession. The fields of economics, business, 
psychology, education, biology, law, computer 
science, police science, and early childhood 
development require at least one course in statistics. 


Included in this chapter are the basic ideas and 
words of probability and statistics. You will soon 
understand that statistics and probability work 
together. You will also learn how data are gathered 
and what "good" data can be distinguished from 
"bad." 


Data, Sampling, and Variation -- MRU - C Lemieux 
(2017) 


Data may come from a population or from a sample. 
Small letters like x or y generally are used to 
represent data values. Most data can be put into the 
following categories: 


* Categorical 
* Quantitative 


Categorical data (also called qualitative data) are 
the result of categorizing or describing attributes of 
a population. Hair colour, blood type, ethnic group, 
the car a person drives, and the street a person lives 
on are examples of categorical data. Categorical 
data are generally described by words or letters. For 
instance, hair colour might be black, dark brown, 
light brown, blonde, grey, or red. Blood type might 
be AB+, O-, or B+. Researchers often prefer to use 
quantitative data over categorical data because it 
lends itself more easily to mathematical analysis. 
For example, it does not make sense to find an 
average hair or colour or blood type. 


There are two types of categorical data: nominal 
and ordinal. Nominal data is categorical data that 
cannot be ordered in a meaningful way. For 
example, the colour of a car is categorical, but the 
order of the colours are not meaningful. Ordinal 
data is categorical data that can be ordered in a 


meaningful way. For example, the level of 
satisfaction someone has with their experience at a 
restaurant from not at all satisfied to completely 
satisfied. 


Quantitative data are always numbers. 
Quantitative data are the result of counting or 
measuring attributes of a population. Amount of 
money, pulse rate, weight, number of people living 
in your town, and number of students who take 
statistics are examples of quantitative data. 
Quantitative data may be either discrete or 
continuous. 


All data that are the result of counting are called 
quantitative discrete data. These data take on only 
certain numerical values. If you count the number of 
phone calls you receive for each day of the week, 
you might get values such as zero, one, two, or 
three. 


All data that are the result of measuring are 
quantitative continuous data assuming that we 
can measure accurately. Measuring time, distance, 
area, and so on; anything that can be subdivided 
and then subdivided again and again is a continuous 
variable. If you and your friends carry backpacks 
with books in them to school, the numbers of books 
in the backpacks are discrete data and the weights 
of the backpacks are continuous data. 


Data Sample of Quantitative Discrete Data 

The data are the number of books students carry in 
their backpacks. You sample five students. Two 
students carry three books, one student carries four 
books, one student carries two books, and one 
student carries one book. The numbers of books 
(three, four, two, and one) are the quantitative 
discrete data. 


Try It 


The data are the number of machines in a 
gym. You sample five gyms. One gym has 12 
machines, one gym has 15 machines, one gym 
has ten machines, one gym has 22 machines, 
and the other gym has 20 machines. What type 
of data is this? 


Try It Solutions 


quantitative discrete data 


Data Sample of Quantitative Continuous Data 
The data are the weights of backpacks with books 
in them. You sample the same five students. The 
weights (in pounds) of their backpacks are 6.2, 7, 


6.8, 9.1, 4.3. Notice that backpacks carrying three 
books can have different weights. Weights are 
quantitative continuous data because weights are 


The data are the areas of lawns in square feet. 


You sample five houses. The areas of the lawns 
are 144 sq. feet, 160 sq. feet, 190 sq. feet, 180 
sq. feet, and 210 sq. feet. What type of data is 

this? 


Try It Solutions 


quantitative continuous data 


You go to the supermarket and purchase three cans 
of soup (19 ounces) tomato bisque, 14.1 ounces 
lentil, and 19 ounces Italian wedding), two 
packages of nuts (walnuts and peanuts), four 
different kinds of vegetable (broccoli, cauliflower, 
spinach, and carrots), and two desserts (16 ounces 
Cherry Garcia ice cream and two pounds (32 
ounces chocolate chip cookies). 


Name data sets that are quantitative discrete, 
quantitative continuous, categorical ordinal, 
and categorical nominal. 


One Possible Solution: 


* The three cans of soup, two packages of 
nuts, four kinds of vegetables and two 
desserts are quantitative discrete data 
because you count them. 

The weights of the soups (19 ounces, 14.1 
ounces, 19 ounces) are quantitative 
continuous data because you measure 
weights as precisely as possible. 

Types of soups, nuts, vegetables and 
desserts are categorical nominal data 
because they are categories and 
fundamentally words. Further, there is no 
meaningful order. 

Descriptions of amount of rain (e.g. light, 
heavy) are categorical ordinal data as 
they categories but have a meaningful 
order. 


Try to identify additional data sets in this example. 


The data are the colors of backpacks. Again, you 
sample the same five students. One student has a 


red backpack, two students have black backpacks, 
one student has a green backpack, and one student 
has a gray backpack. The colors red, black, black, 
green, and gray are categorical nominal data. 


Try It 


The data are the colors of houses. You sample 
five houses. The colors of the houses are white, 
yellow, white, red, and white. What type of 
data is this? 


Try It Solutions 


categorical nominal data 


Note 


You may collect data as numbers and report it 
categorically. For example, the quiz scores for each 
student are recorded throughout the term. At the 
end of the term, the quiz scores are reported as A, 
B, C, D, or F. The data is ordinal as there is a 
meaningful order. 


Try It 


Determine the correct data type (quantitative 
or categorical) for the number of cars in a 
parking lot. Indicate whether quantitative data 


are continuous or discrete. 
Try It Solutions 


quantitative discrete 


A statistics professor collects information 
about the classification of her students as 
freshmen, sophomores, juniors, or seniors. The 
data she collects are summarized in the pie 
chart [link]. What type of data does this graph 
show? 


Classification of Statistics Students 


Freshman 

® Sophomore 

~— Junior 
Senior 


This pie chart shows the students in each year, 
which is categorical nominal data. 


Try It 


The registrar at State University keeps records 
of the number of credit hours students 
complete each semester. The data he collects 
are summarized in the histogram. The class 
boundaries are 10 to less than 13, 13 to less 
than 16, 16 to less than 19, 19 to less than 22, 
and 22 to less than 25. 


Number of Credit Hours 
Completed per Students 


Number of students 


10 a3 16 19 22 25 
Credit hours completed 


What type of data does this graph show? 
Try It Solutions 


A histogram is used to display quantitative 
data: the numbers of credit hours completed. 


Because students can complete only a whole 
number of hours (no fractions of hours 
allowed), this data is quantitative discrete. 


Sampling 


Gathering information about an entire population 
often costs too much or is virtually impossible. 


Instead, we use a sample of the population. The goal 
would be to use information from the sample to 
estimate information about the population. 


To collect the sample, a sampling technique is used. 
Not all sampling techniques are created equal, 
though. A good sampling technique meets the 
following criteria: 


* The sample is collected randomly 
* The sample is representative of the population 
* The size of the sample is large enough 


If a sampling technique does not meet these criteria, 
then it is not appropriate to make inferences from 
the data. For example, it would not be appropriate 
to estimate the population mean from the sample 
mean. 


A random sample reduces bias, promotes 
representativeness, and is a key component to 
sampling. To do any scientific statistical analysis 
on sample data, the sample has to be randomly 
selected. In a random sample, members of the 
population are selected in such a way that each has 
an equal chance of being selected. To ensure that a 
sample is collected randomly, some element of 
randomness needs to be included in the sampling 
technique. This can involve using dice to choose the 
time to start collecting data or using a random 
number generator to pick names from a list of 


names. 


Humans in general are not very random. Therefore, 
the randomness added to the sampling technique 
cannot be someone “randomly” choosing 


something. The randomness has to come from a 
random event (like rolling dice, flipping a coin, 
using a random number generator). 


A sample is representative if it shares similar 
characteristics to the population. For example, 
suppose that the students at a university are 
distributed as follows by faculty: 


* Business: 20% 

* Arts: 25% 

* Science and Engineering: 30% 
* Nursing: 15% 

* Education: 10% 


Then a sample would be representative of this 
population if the distribution of the students’ faculty 
in the sample was similar to the population. It 
doesn’t have to be exactly the same, but it should be 
close. A random sample will generate a fairly 
representative sample, but it doesn’t guarantee it. 


hat makes a sample representative depends on 
hat is being studied. For example, if we are 
looking at the average age of students at a 


university, making sure we get students from each 
faculty would be important, but making sure we 
get students from various political affiliations 
might not be. 


Determining if a sample is large enough is a bit 
arbitrary and depends on the situation. In general, 
the larger the sample size the better, but issues such 
as time and money need to be taken into account. 
You don’t want to interview 5000 people, when 50 
people would do. In Chapter 7, we will look at a 
formula that determines how many members of a 
population need to be in a sample depending on the 
level of error we are comfortable with. Until then, 
as a general rule, if the data is quantitative, a 
sample of at least 30 is usually good enough. While 
if the data is categorical, a sample of at least 100 is 
usually good enough. 


In general, even if a sample is collected extremely 
well, it will not be perfectly representative of the 
population. The discrepancy between the sample 
and the population is called chance error due to 
sampling. When dealing with samples, there will 
always be error. Statistics helps us to understand 
and even measure this error. As a rule, the larger 


the random sample, in general the smaller the 
sampling error. 


Generally, a sample that is collected randomly will 
likely be representative. But this is not guaranteed. 
For example, it is possible to collect a random 
sample of university students that happens to only 
contain students from one faculty. It is unlikely but 
possible. 

A large sample size does not guarantee a 
representative sample. Nor does a small sample 
size guarantee a non-representative sample. To 
illustrate, a sample of ten university students could 
be chosen so that proportion of students from each 
gender in the sample is similar to the population, 
and the proportion of students from each faculty in 
the sample is similar to the population. Thus, the 
sample of size 10 would be representative. The 
point of a larger sample size is that the larger the 
sample, the more likely it is to be representative. 
Of the three characteristics of a good sample, the 
most important one for statistical analysis is that 
the sample is collected randomly. 


Areas of concern for sampling bias 
When people publish their research, they include a 
description of their sampling technique. This is 


called the methodology. When evaluating a 
sampling technique, check to see if the sample was 
collected randomly, if it is representative of the 
population, and if the sample is large enough. Here 
are some examples of areas of concern when looking 
at methodologies: 


1. Undercoverage occurs when a particular subset 
of the population is excluded from the process 
of selecting the sample. For example, if no one 
from the faculty of nursing is included in the 
sample, then we would say that the faculty of 
nursing is undercovered. As another example, 
undercoverage has been a specific concern in 
drug research over the years. In particular, 
women have been traditionally excluded from 
drug studies because of their menstrual cycles, 
but this results in the research only indicating 
how well the drug works for men. 

2. Nonresponse bias occurs when a member of the 
population that is selected as part of the sample 
cannot be contacted or refuses to participate. 
Have you ever refused to be part of a telephone 
study? If so, you are contributing to 
nonresponse bias. 


¢ Similar to nonresponse bias is voluntary 
response bias. Here a large segment of the 
population is contacted and people choose 
to participate or not. Examples of this are 
mail-out surveys or online polls. In these 


situations, usually the person is very 
invested in the issue so that is why they 
take the time to answer. This results in 
non-representative samples. Another form 
of voluntary response bias is online 
surveys. Here, only people familiar with 
the website are likely to participate or 
"volunteer" to be part of the survey. 

* Response rate is a measure of how many 
people responded out of the total 
contacted. If the response rate is low, then 
this suggests a very narrow segment of the 
population answered. This would raise 
concerns about representativeness. 


5. Asking potentially awkward questions might 
result in untruthful responses. This is called 
response bias. For example, if you are asked if 
you have ever had a sexually transmitted 
infection, you may not want to divulge that. 
One way to minimize response bias is to allow 
participants in a study to answer the questions 
anonymously. 

6. Improper wording of questions being asked 
might result in skewed answers. Here is an 
example of a question that skews the results: 


* Do you think it should be easier for seniors 
to make ends meet? 


© Yes — they’ve worked hard and helped 


build our country 
© No -seniors don’t need any help or 
recognition 


The wording of this question makes it hard 
to say "no". Thus, skewing the results 
towards "yes". 


A famous example of a survey that had a very poor 
methodology was the incorrect prediction by the 
Literary Digest that Dewey would beat Truman in 
the 1936 US election. Check out the following 
website for more information: https:// 
www.math.upenn.edu/~deturck/m170/wk4/ 
lecture/case2.html 


Sampling techniques 


Most statisticians and researchers use various 
methods of random sampling in an attempt to 
achieve a good sample. This section will describe a 
few of the most common techniques: simple random 
sampling, (proportional) stratified random 
sampling, cluster sampling, systematic random 
sampling, and convenience sampling. 


Simple random sampling 

The easiest method to describe is called a simple 
random sample. In this technique, a random 
sample is taken from the members of the 


population. This can be done by putting the names 
(or identifier) of all members of the population into 
a hat and pulling out those names (or identifiers) to 
choose the sample. Or the population can be 
numbered and a random number generator can 
choose the sample. Here, each member of the 
population has an equal chance of being chosen. If 
the goal of the technique is to get a very random 
sample, this is the best method to use. But it 
requires having a list of the whole population, 
which is not always realistic. 


For example, suppose you want to take a random 
sample of university students. Each student is 
already numbered by their student ID. You could 
randomly select the members of your sample by 
using a random number generator to randomly 
select student ID numbers. 


Stratified sampling and proportionate stratified 
sampling 

If there are concerns that a random sample might 
not fully represent a population (e.g. one portion of 
the population is small compared to another), the 
best sampling technique to use is stratified random 
sampling. In this case, divide the population into 
groups called strata and then take a random sample 
from each stratum. The stratum are chosen to be a 
portion of the population that needs to be 
represented in the sample. Each stratum needs to be 
mutually exclusive from any other strata. That 


means that each member of the population can only 
belong to one stratum. 


For example, you could stratify (group) your 
university population by faculty and then choose a 
simple random sample from each stratum (each 
faculty) to get a stratified random sample. As a 
student should only belong to one faculty, the 
groups are mutually exclusive. Further, this method 
ensures our sample is representative of the 
population by choosing students from each faculty 
at the university. Using the students per faculty 
example above, if the sample size is 100, to get a 
stratified sample, you would randomly select 20 
students from each faculty (as there are 5 faculties 
and 100 students, choose an equal number from 
each faculty). 


If the size of the sample is proportionate to the size 
of the strata, this is called proportionate stratified 
random sampling. If you wanted a proportionate 
stratified random sample for students by faculty, 
you would randomly select 20 students from 
business, 25 students from arts, 30 from science and 
engineering, 15 from nursing, and 10 from 
education (i.e. proportional to the number of 
students in each faculty). This technique is best used 
when there are large differences in the proportion of 
each group. For example, if the faculty of business 
had 50% of the students and the faculty of nursing 
only had 1% of the students, it would not be good to 


have an equal number of students from each faculty. 


To randomly choose students from each faculty, a 
random sampling technique needs to be used. This 


could be simple random sampling or systematic 
random sampling (see below). 


Cluster sampling 

To choose a cluster sample, divide the population 
into clusters (groups) and then randomly select one 
of the clusters. That cluster is your sample. Further, 
the clusters need to be homogeneous and each 
cluster needs to be representative of the population. 
For example, suppose the university has a series of 
foundational classes that every student has to take 
and that students in these classes come from all 
faculties. Then we would randomly select one of 
these classes to be our sample. Again, to randomly 
select the four departments, you have to use a 
random sampling technique. Here, you could 
number all of the classes and then use a random 
number generator to choose one of them. 


If one cluster is too small for the sample, you can 
choose more than one cluster. For example, if you 
want your sample to be 120 students but each of the 
foundational classes only have 30 students in them, 


you can randomly select 4 classes to get to your 
desired sample size. 


Cluster sampling can be very convenient as the 
members of the sample are in one location. In the 
above example, the sample are in one class so you 
would just go to the one class and collect your 
sample. Notice that for stratified sampling, we 
would have to find each student chosen from each 
faculty. Thus, cluster sampling can save time and 
money. But it does present a real chance of 
undercoverage. If the foundational class chosen is at 
a time that nursing students are at a practicum, then 
that faculty would be undercovered. This means 
that cluster sampling can result in non- 
representative samples. This is only a good 
technique to use if the clusters are very similar to 
each other and each cluster would be representative 
of the population. 


Cluster vs. stratified 
Cluster sampling and stratified sampling are often 
confused. In each case, the population is divided 
into groups. But, in stratified sampling, a few 
people from all groups (strata) are chosen. While in 
cluster sampling, all of the people from a group 
(cluster) are chosen. 

dditionally how the groups are chosen are 
different. In stratified sampling, the groups are 


chosen to be heterogeneous (i.e. each group has a 
different quality). As an example, breaking a 
university into different faculties results in groups 
that are heterogeneous as each group has a 
different quality (i.e. faculty) than the other 
groups. On the other hand, in cluster sampling, the 
groups are chosen to be homogeneous (i.e. the 
groups have similar qualities). That is, we want 
each cluster to be similar to the other groups. 


Systematic random sampling 

To choose a systematic random sample, randomly 
select a starting point and take every kth piece of 
data from a list of the population. For example, to 
choose a random sample of university students, you 
could use a list of all student names that are 
numbered by their student ID. Suppose there are 
14,000 students at the university. To perform 
systematic random sampling, use a random number 
generator to pick a student ID number that 
represents the first name in the sample. Then 
calculate k. To do this, k is found by taking the 
population size (14,000) and dividing by the size of 
the sample (100). In this case, this results in 140. 
Thus, from your random starting point, choose 
every 140th name thereafter until you have a total 
of 100 names. If you reach the end of the list before 
completing your sample you simply go back to the 
beginning and keep going until the sample is 


complete. 


Be careful: k needs to be large enough to ensure that 
you cycle through all the names. Otherwise the 
sample is not random nor is it representative. If k 
had been 10, then once the random starting point 
was chosen only 1000 names had a chance of being 
chosen which means that not everyone has an equal 
chance of being chosen. Further, depending on how 
the list is sorted, it may not be representative. For 
example, if our list of students is by faculty, then 
only certain faculties could make it in our sample. 
In our example, any k larger than 140 would be 
appropriate. Systematic sampling is frequently 
chosen because it is a simple method that can be 
easily implemented. But like simple random 
sampling, a list of the population is needed to do it 


properly. 


There is a variation of systematic random sampling 
that can be used when the list of the population 
does not exist or is not available to the people doing 
the pull. For example, suppose you are doing a 
survey about people’s satisfaction with a certain 
mall’s hours. You won’t have a list of all of the 
people who go to the mall. Instead, you may stand 
at an entrance to the mall and ask every fifth person 
who enters the mall to complete your survey. To 
ensure the sampling technique is representative, 
you'll want to do the survey multiple times at 
multiple locations. To ensure that the sampling 


technique is random, you'll want to randomly 
choose your starting times and locations. Having 
said that, this method would never be completely 
representative nor random. But may be your only 
choice if the population is not well defined. 


Randomness and ethics 

When we are performing a study, we cannot force 
people to be part of it. People have a right to say 
no and as researchers we need to seek informed 
consent. That is, the participants should know what 
they are being asked to do, how their information 
will be kept secure, if there are any risks to 


participation (and if so what they are), and how to 
see the results of the study. As such, people can 
choose not to participant in a study. 

Thus, all studies involving humans are never 
completely random nor completely representative. 
Our goal when implementing sampling techniques 
is to minimize any bias that may come into the 
study because of this. 


Convenience sampling 

A type of sampling that is non-random is 
convenience sampling. Convenience sampling 
involves using results that are readily available. For 
example, a computer software store conducts a 


marketing study by interviewing potential 
customers who happen to be in the store browsing 
through the available software. The results of 
convenience sampling may be very good in some 
cases and highly biased (favour certain outcomes) in 
others. This is not a valid sampling technique when 
it comes to statistical inference. That is, if the data 
is collected using a convenience sample, then no 
conclusions can be made about the population from 
the sample. 


With replacement or without replacement 

True random sampling is done with replacement. 
That is, once a member is picked, that member goes 
back into the population and thus may be chosen 
more than once. However, for practical reasons, in 
most populations, simple random sampling is done 
without replacement. Surveys are typically done 
without replacement. That is, a member of the 
population may be chosen only once. Most samples 
are taken from large populations and the sample 
tends to be small in comparison to the population. 
Since this is the case, sampling without replacement 
is approximately the same as sampling with 
replacement because the chance of picking the same 
individual more than once with replacement is very 
low. 


Too illustrate how small of chance it is, consider a 
university with a population of 10,000 people. 
Suppose you want to pick a sample of 1,000 


randomly for a survey. For any particular sample 
of 1,000, if you are sampling with replacement, 


* the chance of picking the first person is 1,000 
out of 10,000 (0.1000); 

* the chance of picking a different second person 
for this sample is 999 out of 10,000 (0.0999); 

* the chance of picking the same person again is 
1 out of 10,000 (very low). 


If you are sampling without replacement, 


¢ the chance of picking the first person for any 
particular sample is 1000 out of 10,000 
(0.1000); 

* the chance of picking a different second person 
is 999 out of 9,999 (0.0999); 

* you do not replace the first person before 
picking the next person. 


Compare the fractions 999/10,000 and 999/9,999. 
For accuracy, carry the decimal answers to four 
decimal places. To four decimal places, these 
numbers are equivalent (0.0999). 


Sampling without replacement instead of sampling 
with replacement becomes a mathematical issue 
only when the population is small. For example, if 
the population is 25 people, the sample is ten, and 
you are sampling with replacement for any 
particular sample, then the chance of picking the 
first person is ten out of 25, and the chance of 


picking a different second person is nine out of 25 
(you replace the first person). 


If you sample without replacement, then the 
chance of picking the first person is ten out of 25, 
and then the chance of picking the second person 
(who is different) is nine out of 24 (you do not 
replace the first person). 


Compare the fractions 9/25 and 9/24. To four 
decimal places, 9/25 = 0.3600 and 9/24 = 0.3750. 
To four decimal places, these numbers are not 
equivalent. 


A study is done to determine the average 
tuition that San Jose State undergraduate 
students pay per semester. Each student in the 
following samples is asked how much tuition 
he or she paid for the Fall semester. What is 
the type of sampling in each case? 


1. A sample of 100 undergraduate San Jose 
State students is taken by organizing the 
students’ names by classification 
(freshman, sophomore, junior, or senior), 
and then selecting 25 students from each. 

2. A random number generator is used to 
select a student from the alphabetical 


listing of all undergraduate students in 
the Fall semester. Starting with that 
student, every 50th student is chosen 
until 75 students are included in the 
sample. 

3. A completely random method is used to 
select 75 students. Each undergraduate 
student in the fall semester has the same 
probability of being chosen at any stage 
of the sampling process. 

4. The freshman, sophomore, junior, and 
senior years are numbered one, two, 
three, and four, respectively. A random 
number generator is used to pick two of 
those years. All students in those two 
years are in the sample. 

5. An administrative assistant is asked to 
stand in front of the library one 
Wednesday and to ask the first 100 
undergraduate students he encounters 
what they paid for tuition the Fall 
semester. Those 100 students are the 
sample. 


a. stratified; b. systematic; c. simple random; 


d. cluster; e. convenience 


Determine the type of sampling used (simple 
random, stratified, systematic, cluster, or 
convenience). 


1. A soccer coach selects six players from a 
group of boys aged eight to ten, seven 
players from a group of boys aged 11 to 
12, and three players from a group of 
boys aged 13 to 14 to form a recreational 
soccer team. 

. A pollster interviews all human resource 
personnel in five different high tech 
companies. 

. A high school educational researcher 
interviews 50 high school female teachers 
and 50 high school male teachers. 

. A medical researcher interviews every 
third cancer patient from a list of cancer 
patients at a local hospital. 

. A high school counselor uses a computer 
to generate 50 random numbers and then 
picks students whose names correspond to 
the numbers. 

. A student interviews classmates in his 
algebra class to determine how many 
pairs of jeans a student owns, on the 
average. 


a. stratified; b. cluster; c. stratified; d. 
systematic; e. simple random; f.convenience 


Try It 


Determine the type of sampling used (simple 
random, stratified, systematic, cluster, or 
convenience). 


A high school principal polls 50 freshmen, 50 
sophomores, 50 juniors, and 50 seniors 
regarding policy changes for after school 
activities. 


stratified 


If we were to examine two samples representing the 
same population, even if we used random sampling 
methods for the samples, they would not be exactly 
the same. Just as there is variation in data, there is 

variation in samples. As you become accustomed to 
sampling, the variability will begin to seem natural. 


Suppose ABC College has 10,000 part-time students 
(the population). We are interested in the average 
amount of money a part-time student spends on 
books in the fall term. Asking all 10,000 students is 
an almost impossible task. 


Suppose we take two different samples. 

First, we use convenience sampling and survey ten 
students from a first term organic chemistry class. 
Many of these students are taking first term 
calculus in addition to the organic chemistry class. 
The amount of money they spend on books is as 
follows: 

$128 $87 $173 $116 $130 $204 $147 $189 $93 
bia SS) 

The second sample is taken using a list of senior 
citizens who take P.E. classes and taking every fifth 
senior citizen on the list, for a total of ten senior 
citizens. They spend: 

$50 $40 $36 $15 $50 $100 $40 $53 $22 $22 

It is unlikely that any student is in both samples. 


a. Do you think that either of these samples is 
representative of (or is characteristic of) the 
entire 10,000 part-time student population? 


a. No. The first sample probably consists of 
science-oriented students. Besides the 
chemistry course, some of them are also taking 
first-term calculus. Books for these classes tend 
to be expensive. Most of these students are, 
more than likely, paying more than the 
average part-time student for their books. The 
second sample is a group of senior citizens 
who are, more than likely, taking courses for 
health and interest. The amount of money they 


spend on books is probably much less than the 
average parttime student. Both samples are 
biased. Also, in both cases, not all students 
have a chance to be in either sample. 


b. Since these samples are not representative 
of the entire population, is it wise to use the 
results to describe the entire population? 


b. No. For these samples, each member of the 
population did not have an equally likely 
chance of being chosen. 


INow, suppose we take a third sample. We choose 
ten different part-time students from the disciplines 
of chemistry, math, English, psychology, sociology, 
history, nursing, physical education, art, and early 
childhood development. (We assume that these are 
the only disciplines in which part-time students at 
BC College are enrolled and that an equal number 
of part-time students are enrolled in each of the 
disciplines.) Each student is chosen using simple 
random sampling. Using a calculator, random 
numbers are generated and a student from a 
particular discipline is selected if he or she has a 
corresponding number. The students spend the 
following amounts: 
$180 $50 $150 $85 $260 $75 $180 $200 $200 
$150 


c. Is the sample biased? 


c. The sample is unbiased, but a larger sample 
would be recommended to increase the 
likelihood that the sample will be close to 
representative of the population. However, for 
a biased sampling technique, even a large 


sample runs the risk of not being 
representative of the population. 


Students often ask if it is "good enough" to take a 
sample, instead of surveying the entire population. 
If the survey is done well, the answer is yes. 


Try It 


A local radio station has a fan base of 20,000 
listeners. The station wants to know if its 
audience would prefer more music or more 
talk shows. Asking all 20,000 listeners is an 
almost impossible task. 


The station uses convenience sampling and 
surveys the first 200 people they meet at one 
of the station’s music concert events. 24 
people said they’d prefer more talk shows, and 
176 people said they’d prefer more music. 


Do you think that this sample is representative 
of (or is characteristic of) the entire 20,000 
listener population? 


Try It Solutions 


The sample probably consists more of people 
who prefer music because it is a concert event. 
Also, the sample represents only those who 


showed up to the event earlier than the 
majority. The sample probably doesn’t 
represent the entire fan base and is probably 
biased towards people who would prefer 
music. 


Variation in Data 


Variation is present in any set of data. For example, 
16-ounce cans of beverage may contain more or less 
than 16 ounces of liquid. In one study, eight 16 
ounce cans were measured and produced the 
following amount (in ounces) of beverage: 


15.8 16.1 15.2 14.8 15.8 15.9 16.0 15.5 


Measurements of the amount of beverage in a 16- 


ounce can may vary because different people make 
the measurements or because the exact amount, 16 
ounces of liquid, was not put into the cans. 
Manufacturers regularly run tests to determine if the 
amount of beverage in a 16-ounce can falls within 
the desired range. 


Be aware that as you take data, your data may vary 
somewhat from the data someone else is taking for 
the same purpose. This is completely natural. 
However, if two or more of you are taking the same 
data and get very different results, it is time for you 
and the others to reevaluate your data-taking 
methods and your accuracy. 


Variation in Samples 


It was mentioned previously that two or more 
samples from the same population, taken 
randomly, and having close to the same 
characteristics of the population will likely be 
different from each other. Suppose Doreen and Jung 
both decide to study the average amount of time 
students at their college sleep each night. Doreen 
and Jung each take samples of 500 students. Doreen 
uses systematic sampling and Jung uses cluster 
sampling. Doreen's sample will be different from 
Jung's sample. Even if Doreen and Jung used the 
same sampling method, in all likelihood their 
samples would be different. Neither would be 


wrong, however. 


Think about what contributes to making Doreen’s 
and Jung’s samples different. 


If Doreen and Jung took larger samples (i.e. the 
number of data values is increased), their sample 
results (the average amount of time a student 
sleeps) might be closer to the actual population 
average. But still, their samples would be, in all 
likelihood, different from each other. This 
variability in samples cannot be stressed enough. 


Size of a Sample 


The size of a sample (often called the number of 
observations) is important. The examples you have 
seen in this book so far have been small. Samples of 
only a few hundred observations, or even smaller, 
are sufficient for many purposes. In polling, samples 
that are from 1,200 to 1,500 observations are 
considered large enough and good enough if the 
survey is random and is well done. You will learn 
why when you study confidence intervals. 


Be aware that many large samples are biased. For 
example, call-in surveys are invariably biased, 
because people choose to respond or not. 


Critical Evaluation 


We need to evaluate the statistical studies we read 
about critically and analyze them before accepting 
the results of the studies. We listed common 
problems with sampling techniques above. We re- 
iterate them here and add a few additional ones. 


¢ Problems with samples: A sample must be 

representative of the population. A sample that 

is not representative of the population is 
biased. Biased samples that are not 
representative of the population give results 
that are inaccurate and not valid. 

Self-selected samples: Responses only by people 

who choose to respond, such as call-in surveys, 

are often unreliable. 

« Sample size issues: Samples that are too small 
may be unreliable. Larger samples are better, if 
possible. In some situations, having small 
samples is unavoidable and can still be used to 
draw conclusions. Examples: crash testing cars 
or medical testing for rare conditions 

¢ Undue influence: collecting data or asking 
questions in a way that influences the response 

* Non-response or refusal of subject to 
participate: The collected responses may no 
longer be representative of the population. 

Often, people with strong positive or negative 
opinions may answer surveys, which can affect 
the results. 


Causality: A relationship between two variables 
does not mean that one causes the other to 
occur. They may be related (correlated) 
because of their relationship through a 
different variable. 

* Self-funded or self-interest studies: A study 
performed by a person or organization in order 
to support their claim. Is the study impartial? 
Read the study carefully to evaluate the work. 
Do not automatically assume that the study is 
good, but do not automatically assume the 
study is bad either. Evaluate it on its merits and 
the work done. 

* Misleading use of data: improperly displayed 
graphs, incomplete data, or lack of context 

* Confounding: When the effects of multiple 

factors on a response cannot be separated. 

Confounding makes it difficult or impossible to 
draw valid conclusions about the effect of each 
factor. 
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Chapter Review 


Data are individual items of information that come 
from a population or sample. Data may be classified 
as categorical nominal, categorical ordinal, 
quantitative continuous, or quantitative discrete. 


Because it is not practical to measure the entire 
population in a study, researchers use samples to 
represent the population. A random sample is a 
representative group from the population chosen by 
using a method that gives each individual in the 
population an equal chance of being included in the 
sample. Random sampling methods include simple 
random sampling, stratified sampling, cluster 
sampling, and systematic sampling. Convenience 
sampling is a nonrandom method of choosing a 
sample that often produces biased data. 


Samples that contain different individuals result in 
different data. This is true even when the samples 
are well-chosen and representative of the 
population. When properly selected, larger samples 
model the population more closely than smaller 


samples. There are many different potential 
problems that can affect the reliability of a sample. 
Statistical data needs to be critically analyzed, not 
simply accepted. 


HOMEWORK 


For the following exercises, identify the type of data 
that would be used to describe a response (quantitative 
discrete, quantitative continuous, or categorical), and 
give an example of the data. 


number of tickets sold to a concert 


quantitative discrete, 150 


percent of body fat 


quantitative continuous, 19.2% 


favorite baseball team 


categorical, Oakland A’s 


time in line to buy groceries 


quantitative continuous, 7.2 minutes 


number of students enrolled at Evergreen 
Valley College 


quantitative discrete, 11,234 students 


most-watched television show 


categorical, Dancing with the Stars 


brand of toothpaste 


categorical, Crest 


distance to the closest movie theatre 


quantitative continuous, 8.32 miles 


age of executives in Fortune 500 companies 


quantitative continuous, 47.3 years 


Use the following information to answer the next two 
exercises: A study was done to determine the age, 
number of times per week, and the duration 
(amount of time) of resident use of a local park in 
Vancouver. The first house in the neighbourhood 
around the park was selected randomly and then 
every 8th house in the neighbourhood around the 
park was interviewed. 


“Number of times per week” is what type of 
data? 


. nominal categorical ordinal 
. quantitative discrete 

. quantitative continuous 

. categorical nominal 

. categorical ordinal 
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“Duration (amount of time)” is what type of 
data? 


1. categorical discrete 

2. quantitative discrete 

3. quantitative continuous 
4. categorical nominal 


5. categorical ordinal 


Airline companies are interested in the 
consistency of the number of babies on each 
flight, so that they have adequate safety 
equipment. Suppose an airline conducts a 
survey. Over Thanksgiving weekend, it surveys 
six flights from Montreal to Halifax to 
determine the number of babies on the flights. 
It determines the amount of safety equipment 
needed by the result of that study. 


1. Using complete sentences, list three things 
wrong with the way the survey was 
conducted. 

2. Using complete sentences, list three ways 
that you would improve the survey if it 
were to be repeated. 


1. The survey was conducted using six similar 
flights. 
The survey would not be a true 
representation of the entire population of 
air travelers. 
Conducting the survey on a holiday 
weekend will not produce representative 
results. 


2. Conduct the survey during different times 
of the year. 
Conduct the survey using flights to and 
from various locations. 
Conduct the survey on different days of the 
week. 


Suppose you want to determine the mean 
number of cans of soda drunk each month by 
students in their twenties at your school. 
Describe a possible sampling method in three to 
five complete sentences. Make the description 
detailed. 


Answers will vary. Sample Answer: You could 
use a systematic sampling method. Stop the 
tenth person as they leave one of the buildings 
on campus at 9:50 in the morning. Then stop 
the tenth person as they leave a different 
building on campus at 1:50 in the afternoon. 


Name the sampling method used in each of the 
following situations: 


1. A woman in the airport is handing out 
questionnaires to travelers asking them to 
evaluate the airport’s service. She does not 
ask travelers who are hurrying through the 


airport with their hands full of luggage, 
but instead asks all travelers who are 
sitting near gates and not taking naps 
while they wait. 

. A teacher wants to know if her students 
are doing homework, so she randomly 
selects rows two and five and then calls on 
all students in row two and all students in 
row five to present the solutions to 
homework problems to the class. 

. The marketing manager for an electronics 
chain store wants information about the 
ages of its customers. Over the next two 
weeks, at each store location, 100 
randomly selected customers are given 
questionnaires to fill out asking for 
information about age, as well as about 
other variables of interest. 

. The librarian at a public library wants to 
determine what proportion of the library 
users are children. The librarian has a tally 
sheet on which she marks whether books 
are checked out by an adult or a child. She 
records this data for every fourth patron 
who checks out books. 

. A political party wants to know the 
reaction of voters to a debate between the 
candidates. The day after the debate, the 
party’s polling staff calls 1,200 randomly 
selected phone numbers. If a registered 
voter answers the phone or is available to 


come to the phone, that registered voter is 
asked whom he or she intends to vote for 
and whether the debate changed his or her 
opinion of the candidates. 


convenience cluster stratified systematic simple 
random 


In advance of the 1936 Presidential Election, a 
magazine titled Literary Digest released the 
results of an opinion poll predicting that the 
republican candidate Alf Landon would win by 
a large margin. The magazine sent post cards to 
approximately 10,000,000 prospective voters. 
These prospective voters were selected from the 
subscription list of the magazine, from 
automobile registration lists, from phone lists, 
and from club membership lists. Approximately 
2,300,000 people returned the postcards. 


1. Think about the state of the United States 
in 1936. Explain why a sample chosen 
from magazine subscription lists, 
automobile registration lists, phone books, 
and club membership lists was not 
representative of the population of the 
United States at that time. 

2. What effect does the low response rate 
have on the reliability of the sample? 

3. Are these problems examples of sampling 


error or nonsampling error? 

. During the same year, George Gallup 
conducted his own poll of 30,000 
prospective voters. His researchers used a 
method they called "quota sampling" to 
obtain survey answers from specific 
subsets of the population. Quota sampling 
is an example of which sampling method 
described in this module? 


. The country was in the middle of the Great 
Depression and many people could not 
afford these “luxury” items and therefore 
not able to be included in the survey. 

. Samples that are too small can lead to 
sampling bias. 

. sampling error 

. stratified 


YouPolls is a website that allows anyone to 
create and respond to polls. One question 
posted April 15 asks: 


“Do you feel happy paying your taxes when 
members of the Obama administration are 
allowed to ignore their tax 

liabilities?” [footnote] 

lastbaldeagle. 2013. On Tax Day, House to Call 
for Firing Federal Workers Who Owe Back 


Taxes. Opinion poll posted online at: http:// 
www. youpolls.com/details.aspx?id = 12328 
(accessed May 1, 2013). 


As of April 25, 11 people responded to this 
question. Each participant answered “NO!” 


Which of the potential problems with samples 
discussed in this module could explain this 
connection? 


Self-Selected Samples: Only people who are 
interested in the topic are choosing to respond. 
Sample Size Issues: A sample with only 11 
participants will not accurately represent the 
opinions of a nation. 


Undue Influence: The question is wording in a 
specific way to generate a specific response. 
Self-Funded or Self-Interest Studies: This 
question was generated to support one person’s 
claim and it was designed to get the answer 
that the person desires. 


Glossary 


Cluster Sampling 
a method for selecting a random sample and 
dividing the population into groups (clusters); 
use simple random sampling to select a set of 


clusters. Every individual in the chosen 
clusters is included in the sample. 


Continuous Random Variable 
a random variable (RV) whose outcomes are 
measured; the height of trees in the forest is a 
continuous RV. 


Convenience Sampling 
a nonrandom method of selecting a sample; 
this method selects individuals that are easily 
accessible and may result in biased data. 


Discrete Random Variable 
a random variable (RV) whose outcomes are 
counted 


Nonsampling Error 
an issue that affects the reliability of sampling 
data other than natural variation; it includes a 
variety of human errors including poor study 
design, biased sampling methods, inaccurate 
information provided by study participants, 
data entry errors, and poor analysis. 


Qualitative Data 
See Data. 


Quantitative Data 
See Data. 


Random Sampling 


a method of selecting a sample that gives 
every member of the population an equal 
chance of being selected. 


Sampling Bias 
not all members of the population are equally 
likely to be selected 


Sampling Error 
the natural variation that results from 
selecting a sample to represent a larger 
population; this variation decreases as the 
sample size increases, so selecting larger 
samples reduces sampling error. 


Sampling with Replacement 
Once a member of the population is selected 
for inclusion in a sample, that member is 
returned to the population for the selection of 
the next individual. 


Sampling without Replacement 
A member of the population may be chosen 
for inclusion in a sample only once. If chosen, 
the member is not returned to the population 
before the next selection. 


Simple Random Sampling 
a straightforward method for selecting a 
random sample; give each member of the 
population a number. Use a random number 
generator to select a set of labels. These 


randomly selected labels identify the 
members of your sample. 


Stratified Sampling 
a method for selecting a random sample used 
to ensure that subgroups of the population are 
represented adequately; divide the population 
into groups (strata). Use simple random 
sampling to identify a proportionate number 
of individuals from each stratum. 


Systematic Sampling 
a method for selecting a random sample; list 
the members of the population. Use simple 
random sampling to select a starting point in 
the population. Let k = (number of 
individuals in the population)/(number of 
individuals needed in the sample). Choose 
every kth individual in the list starting with 
the one that was randomly selected. If 
necessary, return to the beginning of the 
population list to complete your sample. 


Definitions of Statistics, Probability, and Key Terms 
-- MRU - C Lemieux (2017) 


The science of statistics deals with the collection, 
analysis, interpretation, and presentation of data. 


The process of statistical analysis follows these 
broad steps. 


. Defining the problem 

. Planning the study 

. Collecting the data for the study 

. Analysis of the data 

. Interpretations and conclusions based on the 
analysis 
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For example, we may wonder if there is a gap 
between how much men and women are paid for 
doing the same job. This would be the problem we 
want to investigate. Before we do the investigation, 
we would want to spend some time defining the 
problem. This could include defining terms (e.g. 
what do we mean by “paid”? what constitutes the 
“same job”?). Then we would want to state a 
research question. A research question is the 
overarching question that the study aims to address. 
In this example, our research question might be: 
“Does the gender wage gap exist?”. 


Once we have the problem clearly defined, we need 
to figure out how we are going to study the 


problem. This would include determining how we 
are going to collect the data for the study. Since it is 
unlikely we are going to find out the salary and 
position of every employee in the world (i.e. the 
population), we need to instead collect data from a 
subset of the whole (i.e. a sample). The process of 
how we will collect the data is called the sampling 
technique. The overall plan of how the study is 
designed is called the sampling design or 
methodology. 


Once we have the methodology, we want to 
implement it and collect the actual data. 


When we have the data, we will learn how to 
organize and summarize data. Organizing and 
summarizing data is called descriptive statistics. 
Two ways to summarize data are by visually 
summarizing the data (for example, a histogram) 
and by numerically summarizing the data (for 
example, the average). After we have summarized 
the data, we will use formal methods for drawing 
conclusions from "good" data. The formal methods 
are called inferential statistics. Statistical inference 
uses probability to determine how confident we can 
be that our conclusions are correct. 


Once we have summarized and analyzed the data, 
we want to see what kind of conclusions we can 
draw. This would include attempting to answer the 
research question and recognizing the limitations of 


the conclusions. 


In this course, most of our time will be spent in the 
last two steps of the statistical analysis process (i.e. 
organizing, summarizing and analyzing data). To 
understand the process of making inferences from 
the data, we must also learn about probability. This 
will help us understand the likelihood of random 
events occurring. 


Key Idea and Terms 


In statistics, we generally want to study a 
population. You can think of a population as a 
collection of persons, things, or objects under study. 
You can think of a population as a collection of 
persons, things, or objects under study. The person, 
thing or object under study (i.e. the object of study) 
is called the observational unit. What we are 
measuring or observing about the observational unit 
is called the variable. We often use the letters X or 
Y to represent a variable. A specific instance of a 
variable is called data. 


Suppose our research question is “Do current NHL 
forwards who make over $3 million a year score, 
on average, more than 20 points a season?” 

The population would be all of the NHL forwards 


who make over $3 million a year and who are 

currently playing in the NHL. The observational 

unit is a single member of the population, which 
ould be any forward that made over $3 million 
ear. The variable is what we are studying about 

the observation unit, which is the number points a 

forward in the population gets in a season. A data 
alue would be the actual number of points. 


In the above example, it would be reasonable to 
look at the population when doing the statistical 
analysis as the population is very well defined, there 
are many websites that have this information 
readily available, and the population size is 
relatively small. But this is not always the case. For 
example, suppose you want to study the average 
profits of oil and gas companies in the world. This 
might be very hard to get a list of all of the oil and 
gas companies in the world and get access to their 
financial reports. When the population is not easily 
accessible, we instead look at a sample. The idea of 
sampling (the process of collecting the sample) is to 
select a portion (or subset) of the larger population 
and study that portion (the sample) to gain 
information about the population. 


Because it takes a lot of time and money to examine 
an entire population, sampling is a very practical 
technique. If you wished to compute the overall 


grade point average at your school, it would make 
sense to select a sample of students who attend the 
school. The data collected from the sample would be 
the students' grade point averages. In federal 
elections, opinion poll samples of 1,000-—2,000 
people are taken. The opinion poll is supposed to 
represent the views of the people in the entire 
country. Manufacturers of canned carbonated drinks 
take samples to determine if a 16 ounce can 
contains 16 ounces of carbonated drink. 


It is important to note that though we might not 
know the population, when we decide to sample 
from it, it is fairly static. Going back to the example 
of the NHL forwards, if we were to gather the data 
for the population right now that would be our fixed 
population. But if you took a sample from that 
population and your friend took a sample from that 
population, it is not surprising that you and your 
friend would get a different sample. That is, there is 
one population, but there are many, many different 
samples that can be drawn from the sample. How 
the samples vary from each other is called sampling 
variability. The idea of sampling variability is a key 
concept in statistics and we will come back to it 
over and over again. 


Data is plural. Datum is singular. 


As mentioned above, a variable, or random variable, 
notated by capital letters such as X and Y, is a 
characteristic of interest for each person or thing in 
a population. Data are the actual values of the 
variable. Data and variables fall into two general 
types: either they are measuring something and they 
are not measuring. When a variable is measuring or 
counting something, it is called a quantitative 
variable and the data is called quantitative. When a 
variable is not measuring or counting something, it 
is called a categorical variable and the data is 
called categorical data. For a variable to be 
considered quantitative, the distance between each 
number has to be fixed. In general, quantitative 
variables measure something and take on values 
with equal units such as weight in pounds or 
number of people in a line. Categorical variables 
place the person or thing into a category such as 
colour of car or opinion on topic. 


* In the NHL forwards example, the variable is 
quantitative as we investigating the number of 
points a player has. 

¢ In the gender gap example, there were three 
variables: the salary, gender, and the position. 
The salary is a quantitative variable as we are 
investigating the amount people make. Gender 
is a categorical variable as we are categorizing 
someone’s gender. Position is also categorical 


as we are categorizing their type of 
employment. 

¢ Sometimes though determining the type of a 
variable (i.e. quantitative or categorical) is not 
always cut and dry. In particular, Likert 
scales or rating scales are tricky to place. A 
Likert scale is any scale where you are asked 
to state your opinion on a scale. For example, 
you may be asked whether you strongly agree, 
agree, neutral, disagree or strongly disagree 
with a statement. Sometimes there is a 
number associated with the rating. For 
example, write 5 if you strongly agree and 1 if 
you strongly disagree. Technically, a Likert 
scale is a categorical data as we are 
categorizing people’s opinions and the number 
is just a short form for the category. 


hen you are asked to categorize the data or 
ariable, first determine what the observation unit 
is. Then determine the variable being studied. 
hen think about what the data will look like. If 
he data is a number, then it is usually 


quantitative data (be wary of Likert scales). If the 
data is word or category, then it is categorical 
data. 


For the following research questions, state the 
observational unit, the variable being studied, 
and the type of variable. 


1. 


2 


What is the average monthly temperature 
in Edmonton? 

What is the highest belt colour that most 
students of karate earn in Canada? 


. What is the average weight of greyhound 


dogs? 


. What is the average gross profit of movies 


made in 2016? 


. What is the average user rating of Jessica 


Jones season 1 on IMDB? 


. What is the most common colour of car in 


Nova Scotia? 


. Observational unit: Edmonton. Variable: 


Monthly Temperature. Type: Quantitative. 


. Observational unit: Student of karate in 


Canada. Variable: Highest colour of belt 
earned. Type: Categorical. 


. Observational unit: Greyhounds. Variable: 


Weight. Type: Quantitative. 


. Observational unit: Movies made in 2016. 


Variable: Gross profit. Type: Quantitative. 


. Observational unit: Jessica Jones. Variable: 


User ratings. Type: Categorical. 


. Observational unit: Cars in Nova Scotia. 


Variable: Colour. Type: Categorical. 


Two words that come up often in statistics are mean 
and proportion. These are two example of 
numerical descriptive statistics. If you were to take 
three exams in your math classes and obtain scores 
of 86, 75, and 92, you would calculate your mean 
score by adding the three exam scores and dividing 
by three (your mean score would be 84.3 to one 
decimal place). If, in your math class, there are 40 
students and 22 are men then the proportion of men 
in the course is 55% and the proportion of women is 
45%. 


From the sample data, we can calculate a statistic. A 
statistic is a numerical summary that represents a 
property of the sample. For example, if we consider 
one math class to be a sample of the population of 
all math classes, then the mean number of points 
earned by students in that one math class at the end 
of the term is an example of a statistic. The statistic 
is an estimate of a population parameter, in this 
case the mean. A parameter is a numerical 
summary that represents a property of the 
population. Since we considered all math classes to 
be the population, then the mean number of points 
earned per student over all the math classes is an 
example of a parameter (i.e. the population mean). 
If we took a sample of students from the math class 
and found the mean points earned per student in the 


sample, then we would have found a statistic (i.e. 
the sample mean). 


In the NHL example, a sample of the population 
may be 31 forwards who make over $3 million per 
year. The sample was chosen by randomly 
choosing one forward who makes over $3 million 
from each team (if you are reading this after Sept. 
2021, this would be changed to 32). The process of 
choosing the sample is called sampling. We would 
then collect the data for the sample, which would 
be the number of points each player in our sample 
gets in one season. The statistic would be the mean 
of the total number of points for the sample. The 
parameter at this point would be unknown, but we 
could estimate it with our statistic. To find the 
parameter, we would have to find the mean of the 
total number of points for the population. 


One of the main concerns in the field of statistics is 
how accurately a statistic estimates a parameter. 
The accuracy really depends on how well the 
sample represents the population. The sample must 
contain the characteristics of the population in order 
to be a representative sample. We are interested in 
both the sample statistic and the population 
parameter in inferential statistics. In a later chapter, 


we will use the sample statistic to test the validity of 
the established population parameter. 


Determine what the key terms refer to in the 
following study. We want to know what 


proportion of first-year students get to ABC 
college using public transit. We randomly 
survey 100 first year students at ABC college. 


The population is all first year students 
attending ABC college this term. 


The sample depends on how we choose the 
students. One possible answer could be all 
students enrolled in one section of a beginning 
statistics course at ABC College (although this 
sample would not be deemed random nor 
representative of the entire population). 


The variable would be whether a first-year 
student uses public transportation to get to 
ABC college or not. 


The data are the actual values of the variable. 
As students would either use public 

transportation or not, the data would be "yes" 
or "no, or "public transporation" or "not public 


transportation" (depending on how you chose 
to represent your data). 


The statistic is the proportion of students in 
your SAMPLE who use public transportation to 
get to ABC college. (Note: The mean would not 
be an appropriate summary here as you cannot 
find the mean of categorical data). 


The parameter is the proportion of ALL first- 
year students who use public transportation to 
get to ABC college. 


Try It 


Determine what the key terms refer to in the 
following study. We want to know the average 
(mean) amount of money spent on school 


uniforms each year by families with children 
at Knoll Academy. We randomly survey 100 
families with children in the school. Three of 
the families spent $65, $75, and $95, 
respectively. 


Try It Solutions 


The population is all families with children 
attending Knoll Academy. 


The sample is a random selection of 100 
families with children attending Knoll 
Academy. 


The parameter is the average (mean) amount 
of money spent on school uniforms by families 
with children at Knoll Academy. 


The statistic is the average (mean) amount of 
money spent on school uniforms by families in 
the sample. 


The variable is the amount of money spent by 
one family. Let X = the amount of money 
spent on school uniforms by one family with 
children attending Knoll Academy. 


The data are the dollar amounts spent by the 
families. Examples of the data are $65, $75, 
and $95. 


Determine what the key terms refer to in the 
following study. 


A study was conducted at a local college to 
analyze the average cumulative GPA’s of 
students who graduated last year. Fill in the 
letter of the phrase that best describes each of 


the items below. 


1.____ Population 2.___ Statistic 3.___ 
Parameter 4.___ Sample 5.___ Variable 6.____ 
Data 


* a) all students who attended the college 
last year 

* b) the cumulative GPA of one student 
who graduated from the college last year 

*€)3.09,-2-00,11-50,.3.90 

* d) a group of students who graduated 
from the college last year, randomly 
selected 

* e) the average cumulative GPA of 
students who graduated from the college 
last year 

* f) all students who graduated from the 
college last year 

* g) the average cumulative GPA of 
students in the study who graduated from 
the college last year 


if 200376405. b orc 


Determine what the key terms refer to in the 


following study. 


As part of a study designed to test the safety of 
automobiles, the National Transportation 
Safety Board collected and reviewed data 
about the effects of an automobile crash on 
test dummies. Here is the criterion they used: 


Speed at which Cars _ Location of “drive” (i.e. 
Crachad Awmmina)\ 


WbUvuiteu RLU LELLLLSL wv) 


35 miles/hour Front Seat 


Cars with dummies in the front seats were 
crashed into a wall at a speed of 35 miles per 
hour. We want to know the proportion of 
dummies in the driver’s seat that would have 
had head injuries, if they had been actual 
drivers. We start with a simple random sample 
of 75 cars. 


The population is all cars containing dummies 
in the front seat. 


The sample is the 75 cars, selected by a simple 
random sample. 


The parameter is the proportion of driver 
dummies (if they had been real people) who 
would have suffered head injuries in the 
population. 


The statistic is proportion of driver dummies 
(if they had been real people) who would have 
suffered head injuries in the sample. 


The variable X = the number of driver 
dummies (if they had been real people) who 
would have suffered head injuries. 


The data are either: yes, had head injury, or 
no, did not. 


Determine what the key terms refer to in the 
following study. 


An insurance company would like to 
determine the proportion of all medical 
doctors who have been involved in one or 
more malpractice lawsuits. The company 
selects 500 doctors at random from a 
professional directory and determines the 
number in the sample who have been involved 
in a malpractice lawsuit. 


The population is all medical doctors listed in 
the professional directory. 


The parameter is the proportion of medical 
doctors who have been involved in one or 
more malpractice suits in the population. 


The sample is the 500 doctors selected at 
random from the professional directory. 


The statistic is the proportion of medical 
doctors who have been involved in one or 
more malpractice suits in the sample. 


The variable X = the number of medical 
doctors who have been involved in one or 
more malpractice suits. 


The data are either: yes, was involved in one 
or more malpractice lawsuits, or no, was not. 
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Chapter Review 


The mathematical theory of statistics is easier to 
learn when you know the language. This module 
presents important terms that will be used 
throughout the text. 


HOMEWORK 


For each of the following eight exercises, identify: a. the 
population, b. the sample, c. the parameter, d. the 
statistic, e. the variable, and f. the data. Give examples 
where appropriate. 


A fitness center is interested in the mean 
amount of time a client exercises in the center 
each week. 


Ski resorts are interested in the mean age that 
children take their first ski and snowboard 
lessons. They need this information to plan 
their ski classes optimally. 


1. all children who take ski or snowboard 
lessons 


2. a group of these children 

3. the population mean age of children who 
take their first snowboard lesson 

4. the sample mean age of children who take 
their first snowboard lesson 

5. X = the age of one child who takes his or 
her first ski or snowboard lesson 

6. values for X, such as 3, 7, and so on 


A cardiologist is interested in the mean 
recovery period of her patients who have had 
heart attacks. 


Insurance companies are interested in the mean 
health costs each year of their clients, so that 
they can determine the costs of health 
insurance. 


1. the clients of the insurance companies 

2. a group of the clients 

3. the mean health costs of the clients 

4. the mean health costs of the sample 

5. X = the health costs of one client 

6. values for X, such as 34, 9, 82, and so on 


A politician is interested in the proportion of 


voters in his district who think he is doing a 
good job. 


A marriage counselor is interested in the 
proportion of clients she counsels who stay 
married. 


— 


. all the clients of this counselor 

2. a group of clients of this marriage 
counselor 

3. the proportion of all her clients who stay 
married 

4. the proportion of the sample of the 
counselor’s clients who stay married 

5. X = the number of couples who stay 
married 

6. yes, no 


Political pollsters may be interested in the 
proportion of people who will vote for a 
particular cause. 


A marketing company is interested in the 
proportion of people who will buy a particular 
product. 


1. all people (maybe in a certain geographic 
area, such as the United States) 

2. a group of the people 

3. the proportion of all people who will buy 
the product 

4. the proportion of the sample who will buy 
the product 

5. X = the number of people who will buy it 

6. buy, not buy 


Use the following information to answer the next three 
exercises: A Lake Tahoe Community College 
instructor is interested in the mean number of days 
Lake Tahoe Community College math students are 
absent from class during a quarter. 


What is the population she is interested in? 


1. all Lake Tahoe Community College 
students 

2. all Lake Tahoe Community College English 
students 

3. all Lake Tahoe Community College 
students in her classes 

4. all Lake Tahoe Community College math 
students 


Consider the following: 


X = number of days a Lake Tahoe Community 
College math student is absent 


In this case, X is an example of a: 


1. variable. 

2. population. 
3. statistic. 

4. data. 


The instructor’s sample produces a mean 
number of days absent of 3.5 days. This value is 
an example of a: 


1. parameter. 
2. data. 

3. statistic. 
4. variable. 


Glossary 


Average 
also called mean or arithmetic mean; a 
number that describes the central tendency of 


the data 


Categorical Variable 


Data 


variables that take on values that are names 
or labels 


a set of observations (a set of possible 
outcomes); most data used in statistical 
research can be put into two groups: 
categorical (an attribute whose value is a 
label) or quantitative (an attribute whose 
value is indicated by a number). Categorical 
data can be separated into two subgroups: 
nominal and ordinal. Data is nominal if it 
cannot be meaningfully ordered. Data is 
ordinal if the data can be meaningfully 
ordered. Quantitative data can be separated 
into two subgroups: discrete and 
continuous. Data is discrete if it is the result 
of counting (such as the number of students 
of a given ethnic group in a class or the 
number of books on a shelf). Data is 
continuous if it is the result of measuring 
(such as distance traveled or weight of 


luggage) 


Numerical Variable 


variables that take on values that are 
indicated by numbers 


Parameter 


a number that is used to represent a 
population characteristic and that generally 
cannot be determined easily 


Population 
all individuals, objects, or measurements 
whose properties are being studied 


Probability 
a number between zero and one, inclusive, 
that gives the likelihood that a specific event 
will occur 


Proportion 
the number of successes divided by the total 
number in the sample 


Representative Sample 
a subset of the population that has the same 
characteristics as the population 


Sample 
a subset of the population studied 


Statistic 
a numerical characteristic of the sample; a 
statistic estimates the corresponding 
population parameter. 


Variable 
a characteristic of interest for each person or 
object in a population 


Experimental Design and Ethics -- MtRoyal - 
Version2016RevA 


Does aspirin reduce the risk of heart attacks? Is one 
brand of fertilizer more effective at growing roses 
than another? Is fatigue as dangerous to a driver as 
the influence of alcohol? Questions like these are 
answered using randomized experiments. In this 
module, you will learn important aspects of 
experimental design. Proper study design ensures 
the production of reliable, accurate data. 


The purpose of an experiment is to investigate the 
relationship between two variables. When one 
variable causes change in another, we call the first 
variable the independent variable or explanatory 
variable. The affected variable is called the 
dependent variable or response variable. In a 
randomized experiment, the researcher manipulates 
values of the explanatory variable and measures the 
resulting changes in the response variable. The 
different values of the explanatory variable are 
called treatments. An experimental unit is a single 
object or individual to be measured. 


You want to investigate the effectiveness of vitamin 
E in preventing disease. You recruit a group of 
subjects and ask them if they regularly take vitamin 
E. You notice that the subjects who take vitamin E 
exhibit better health on average than those who do 
not. Does this prove that vitamin E is effective in 


disease prevention? It does not. There are many 
differences between the two groups compared in 
addition to vitamin E consumption. People who take 
vitamin E regularly often take other steps to 
improve their health: exercise, diet, other vitamin 
supplements, choosing not to smoke. Any one of 
these factors could be influencing health. As 
described, this study does not prove that vitamin E 
is the key to disease prevention. 


Additional variables that can cloud a study are 
called lurking variables. In order to prove that the 
explanatory variable is causing a change in the 
response variable, it is necessary to isolate the 
explanatory variable. The researcher must design 
her experiment in such a way that there is only one 
difference between groups being compared: the 
planned treatments. This is accomplished by the 
random assignment of experimental units to 
treatment groups. When subjects are assigned 
treatments randomly, all of the potential lurking 
variables are spread equally among the groups. At 
this point the only difference between groups is the 
one imposed by the researcher. Different outcomes 
measured in the response variable, therefore, must 
be a direct result of the different treatments. In this 
way, an experiment can prove a cause-and-effect 
connection between the explanatory and response 
variables. 


The power of suggestion can have an important 


influence on the outcome of an experiment. Studies 
have shown that the expectation of the study 
participant can be as important as the actual 
medication. In one study of performance-enhancing 
drugs, researchers noted: 


Results showed that believing one had taken the 
substance resulted in [performance] times almost as 
fast as those associated with consuming the drug itself. 
In contrast, taking the drug without knowledge yielded 
no significant performance increment. [footnote] 
McClung, M. Collins, D. “Because I know it will!”: 
placebo effects of an ergogenic aid on athletic 
performance. Journal of Sport & Exercise 
Psychology. 2007 Jun. 29(3):382-94. Web. April 30, 
2013. 


When participation in a study prompts a physical 
response from a participant, it is difficult to isolate 
the effects of the explanatory variable. To counter 
the power of suggestion, researchers set aside one 
treatment group as a control group. This group is 
given a placebo treatment—a treatment that cannot 
influence the response variable. The control group 
helps researchers balance the effects of being in an 
experiment with the effects of the active treatments. 
Of course, if you are participating in a study and 
you know that you are receiving a pill which 
contains no actual medication, then the power of 
suggestion is no longer a factor. Blinding in a 
randomized experiment preserves the power of 


suggestion. When a person involved in a research 
study is blinded, he does not know who is receiving 
the active treatment(s) and who is receiving the 
placebo treatment. A double-blind experiment is 
one in which both the subjects and the researchers 
involved with the subjects are blinded. 


The Smell & Taste Treatment and Research 
Foundation conducted a study to investigate 
whether smell can affect learning. Subjects 
completed mazes multiple times while wearing 
masks. They completed the pencil and paper 
mazes three times wearing floral-scented 
masks, and three times with unscented masks. 
Participants were assigned at random to wear 
the floral mask during the first three trials or 
during the last three trials. For each trial, 
researchers recorded the time it took to 
complete the maze and the subject’s 
impression of the mask’s scent: positive, 
negative, or neutral. 


1. Describe the explanatory and response 
variables in this study. 

2. What are the treatments? 

3. Identify any lurking variables that could 
interfere with this study. 

4. Is it possible to use blinding in this study? 


. The explanatory variable is scent, and the 
response variable is the time it takes to 
complete the maze. 

. There are two treatments: a floral-scented 
mask and an unscented mask. 

. All subjects experienced both treatments. 
The order of treatments was randomly 
assigned so there were no differences 
between the treatment groups. Random 
assignment eliminates the problem of 
lurking variables. 

. Subjects will clearly know whether they 
can smell flowers or not, so subjects 
cannot be blinded in this study. 
Researchers timing the mazes can be 
blinded, though. The researcher who is 
observing a subject will not know which 
mask is being worn. 
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Chapter Review 


A poorly designed study will not produce reliable 
data. There are certain key components that must be 
included in every experiment. To eliminate lurking 
variables, subjects must be assigned randomly to 
different treatment groups. One of the groups must 
act as a control group, demonstrating what happens 
when the active treatment is not applied. 
Participants in the control group receive a placebo 
treatment that looks exactly like the active 
treatments but cannot influence the response 
variable. To preserve the integrity of the placebo, 
both researchers and subjects may be blinded. When 
a study is designed properly, the only difference 
between treatment groups is the one imposed by the 
researcher. Therefore, when groups respond 
differently to different treatments, the difference 
must be due to the influence of the explanatory 
variable. 


“An ethics problem arises when you are considering 
an action that benefits you or some cause you 
support, hurts or reduces benefits to others, and 
violates some rule.” [footnote] Ethical violations in 
statistics are not always easy to spot. Professional 
associations and federal agencies post guidelines for 
proper conduct. It is important that you learn basic 


statistical procedures so that you can recognize 
proper data analysis. 


Glossary 


Explanatory Variable 
the independent variable in an experiment; 
the value controlled by researchers 


Treatments 
different values or components of the 
explanatory variable applied in an experiment 


Response Variable 
the dependent variable in an experiment; 
the value that is measured for change at the 
end of an experiment 


Experimental Unit 
any individual or object to be measured 


Lurking Variable 
a variable that has an effect on a study even 
though it is neither an explanatory variable 
nor a response variable 


Random Assignment 
the act of organizing experimental units into 
treatment groups using random methods 


Control Group 
a group in a randomized experiment that 


receives an inactive treatment but is 
otherwise managed exactly as the other 
groups 


Informed Consent 
Any human subject in a research study must 
be cognizant of any risks or costs associated 
with the study. The subject has the right to 
know the nature of the treatments included in 
the study, their potential risks, and their 
potential benefits. Consent must be given 
freely by an informed, fit participant. 


Institutional Review Board 
a committee tasked with oversight of research 
programs that involve human subjects 


Placebo 
an inactive treatment that has no real effect 
on the explanatory variable 


Blinding 
not telling participants which treatment a 
subject is receiving 


Double-blinding 
the act of blinding both the subjects of an 
experiment and the researchers who work 
with the subjects 


Introduction -- Descriptive Statistics -- MRU - C 
Lemieux (2017) 

class = "introduction" When you have large amounts 
of data, you will need to organize it in a way that 
makes sense. These ballots from an election are 
rolled together with similar ballots to keep them 
organized. (credit: William Greeson) 


Chapter objective 
By the end of this chapter, the student should be 
able to: 


* Display data graphically and interpret graphs: 
pie charts, bar graphs, histograms and box 
plots. 

* Recognize, describe, calculate, and interpret 
measures of location: quartiles and 


percentiles. 
* Recognize, describe, calculate, and interpret 
measures of centre: mean, median and mode. 
* Recognize, describe, calculate, and interpret 
measures of variation: variance, standard 
deviation, range, interquartile range and 
coefficient of variation. 


Once you have collected data, what will you do with 
it? Data can be described and presented in many 
different formats. For example, suppose you are 
interested in buying a house in a particular area. 
You may have no clue about the house prices, so 
you might ask your real estate agent to give you a 
sample data set of prices. Looking at all the prices in 
the sample often is overwhelming. A better way 
might be to look at the median price and the 
variation of prices. The median and variation are 
just two ways that you will learn to describe data. 
Your agent might also provide you with a graph of 
the data. 


In this chapter, you will study visual and numerical 
ways to describe and display your data. This area of 
statistics is called Descriptive Statistics. If you 
have collected 200 data values, just looking at them 
won't tell anyone much about the data. Instead, you 
want to summarize the raw data in a way that you 
can better understand what’s going on. 


Categorical data is summarized usually using a 
visual representation like a pie chart or a bar graph. 
The numerical summary for categorical data would 
be a percentage, fraction or decimal. 


For quantitative data, it is a bit more involved. In 
general, there are three components to a good 
summary of quantitative data: a visual 
representation, a measure of centre, and a measure 
of variation. 


The visual representation can give you a sense of 
the centre and variation in the data, but is very 
useful for determining the shape of the data. Is the 
data all clustered together? Are there a bunch of 
data on one side, but a few on the other? Do all of 
the data values occur with the same frequency? The 
shape describes this. Histograms and box plots are 
both visual representations of quantitative data. 


Measures of centre, also known as averages or 
measures of central tendency, provide a value(s) 
that gives us a sense of a typical value in the data 
set. This doesn’t tell us about a specific member of 
the population, but instead lets us know what the 
average one is like. Measures of centre we will learn 
about include the mean, median, and mode. 


Though a measure of centre tells us about a typical 
value in a data set, measures of variation tell us 
how much the data values vary from each other. Are 


they all clumped together? Are they all spread out? 
Measures of variation can tell us how consistent or 
how volatile the data is. If we are analyzing stock 
prices, the more variation there is then the more 
volatile and risky the investment is. But the rewards 
may be greater! Measures of variation that we will 
learn about include range, variance, standard 
deviation, interquartile range, and the coefficient of 
variation. 


When we describe the shape, centre, and variation 
of the data, we are describing the distribution of 
the data. If we only focus on one aspect of the 
distribution (say the centre), then we miss out on 
some important information, which is why we 
always want to consider all three aspects when 
summarizing quantitative data. For example, 
suppose two stock prices have the same average 
price. If we only look at the average, we might think 
they are equivalent. But if one of them has greater 
variation, then that means that one is more volatile 
and riskier than the other one. 


Box plots (or box and whisker diagrams) are a 
special type of visual representation that includes 
both visual and numerical elements. A box plot 
divides the data into quarters (or quartiles). Thus, a 
box plot contains a measure of centre (the second 
quartile is the halfway point, called the median) and 
a measure of variation (the distance between the 
first quartile and the third quartile is called the 


interquartile range). The box plot can also give a 
sense of the data’s shape. The box plot then is the 
only representation that we will see that gives us a 
sense of the distribution all in one representation 
(i.e. gives a sense of centre, variation, and 
distribution). It also has an additional benefit of 
identifying outliers. Outliers are data values that 
are abnormal. That is, they differ significantly from 
the other data values. A box plot shows if there are 
any outliers. 


This chapter will go over descriptive statistics by 
focusing on visual and numerical representations of 
data. Though categorical data is discussed, the main 
focus will be on determining the distribution and 
outliers for quantitative data. 


The vast majority of the time when conducting 
statistical studies, we will only have access to 
sample data. In this situation, we will want to 
analyze the sample data to see if we can come to 
any conclusions about the population data. Once we 
make the leap from simply describing a sample to 
using that sample to draw conclusions about the 
population, we are doing inferential statistics. 
These concepts and techniques are covered in 
chapter seven and eight. 


Key Idea 


The distribution of sample data ideally mimics the 
distribution of the population. But the smaller the 
sample size the greater the potential for there to be 
differences between the two distributions. This 
means that, for a large enough sample size, the 
distribution of the sample generally gives a good 
idea of distribution of the population. This is an 
example of the law of large numbers. In other 
words, if the sample size is large enough and the 
data is collected properly, then the sample mean 
will most likely be a good estimate of the 
population mean, the sample standard deviation 
will most likely be a good estimate of the 
population standard deviation, and the shape of the 
sample data will most likely be a good estimate of 
the shape of the population. 


Descriptive Statistics - Visual Representations of 
Data - MRU - C Lemieux (2017) 


Visual representations of categorical data 


Below are tables comparing the number of part-time 
and full-time students at De Anza College and 
Foothill College enrolled for the spring 2010 
quarter. The tables display counts (frequencies) and 
percentages or proportions (relative frequencies). 
The percent columns make comparing the same 
categories in the colleges easier. Displaying 
percentages along with the numbers is often helpful, 
but it is particularly important when comparing sets 
of data that do not have the same totals, such as the 
total enrollments for both colleges in this example. 
Notice how much larger the percentage for part- 
time students at Foothill College is compared to De 
Anza College. 


De Foothill 
Anza College 


VMVULIT SES 


Naam vnaMarnc + Naam raaDarnrant 
DWULLEE Ye VLE LIE DWELLER VLE LILE 


Full- 9,200 40.9% Full- 4,059 28.6% 
Part- 13,296 59.1% Part- 10,124 71.4% 


Total 22,496 100% Total 14,183 100% 


Fall Term 2007 (Census day) 


Tables are a good way of organizing and displaying 
data. But graphs can be even more helpful in 
understanding the data. There are no strict rules 
concerning which graphs to use. Two graphs that 
are used to display categorical data are pie charts 
and bar graphs. 


In a pie chart, categories of data are represented by 
wedges in a circle and are proportional in size to the 
percent of individuals in each category. 


In a bar graph, the length of the bar for each 
category is proportional to the number or percent of 
individuals in each category. Bars may be vertical or 
horizontal. 


Look at [link] and [link] and determine which 
graph (pie or bar) you think displays the 
comparisons better. 


It is a good idea to look at a variety of graphs to see 
which is the most helpful in displaying the data. We 
might make different choices of what we think is the 


“Dest” graph depending on the data and the context. 
Our choice also depends on what we are using the 
data for. 


De Anza College 


' Part time 
® Full time 


Foothill College 


Part time 
® Full time 


Student Status 


De Anza Foothill 
®@ Fulltime © Part time 


Visual Representations of Quantitative 
Data 


Bar Graphs 


Bar graphs can also be used to summarize discrete 
quantitative data and categorical data. Bar graphs 
consist of bars that are separated from each other. 
The bars can be rectangles or they can be 
rectangular boxes (used in three-dimensional plots), 
and they can be vertical or horizontal. The bar 
graph shown in [link] has age groups represented 
on the x-axis and proportions on the y-axis. 


By the end of 2011, Facebook had over 146 
million users in the United States. [link] shows 
three age groups, the number of users in each 
age group, and the proportion (%) of users in 
each age group. Construct a bar graph using 
this data. 


Age groups Number of Proportion (%) 


Facebook of Facebook 
users users 
132-25 65082-2860 ABs 
26—44, 53300-2008 3E0% 
45-64 27,885,100 19% 
50 
45 
40 
S 35 
< 30 
£25 
& 20 
a 15 
10 
5 
0 
13-25 26-44 45-64 


Ages 


Try it 


Park city is broken down into six voting 
districts. The table shows the percent of the 
total registered voter population that lives in 
each district as well as the percent total of the 
entire population that lives in each district. 
Construct a bar graph that shows the 
registered voter population by district. 


District Registered Overall city 
voter population 
population 

1 LE 596 LOAN 

zi 422% 1-5-6% 

3 9.8% 9.9% 

4 174% 125% 
228% 20 UG 


~ 
6 22.3% 16.8% 


25.0% 


20.0% 


15.0% 


10.0% 


5.0% 


Voter Proportion (%) 


0.0% 


District 


Frequency tables 


Twenty students were asked how many hours they 
worked per day. Their responses, in hours, are as 
follows: 56332475235654435253. 


[link] lists the different data values in ascending 
order and their frequencies. 


Da-s w toy 
to on wo cn 


Frequency Table of Student Work Hours 


A frequency is the number of times a value of the 
data occurs. According to [link], there are three 
students who work two hours, five students who 
work three hours, and so on. The sum of the values 
in the frequency column, 20, represents the total 
number of students included in the sample. 


A relative frequency is the ratio (fraction or 
proportion) of the number of times a value of the 
data occurs in the set of all outcomes to the total 
number of outcomes. To find the relative 
frequencies, divide each frequency by the total 
number of students in the sample—in this case, 20. 
Relative frequencies can be written as fractions, 
percents, or decimals. 


DATA VALUE FREQUENCY RELATIVE 
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rhny VULIWNUL 


== (DO bo 


qn 
CO ta bo - 
cc a ¢ 


CN CO CH ON 
rv) 

id id tot 
> 
) 
3 

C> (> CD ¢ 


>) 
ive) 


9IOIN avr N10 
at ie VIL 


Veltvd 


7 1 1 20 or 0.05 


Frequency Table of Student Work Hours with 
Relative Frequencies 


The sum of the values in the relative frequency 
column of [link] is 20 20, or 1. 


Cumulative relative frequency is the 
accumulation of the previous relative frequencies. 
To find the cumulative relative frequencies, add all 
the previous relative frequencies to the relative 
frequency for the current row, as shown in [link]. 


DATA FREQUEN CYRELATIVE CUMULATIVE 
VALUE FREQUEN CYRELATIVE 


PM TPAAT TTI RATS7 

PV LVULIVNUL 
290 ar NTE ATE 
eV VEL Ve tYVY Ve lv 


va 


i) 


3 5 5 20 or 0.25 0.15 + 0.25 
= 9.40 

4 3 3 20 or 0.15 0.40 + 0.15 
=10.55 

5 6 6 20 or 0.30 0.55 + 0.30 
— NOE 


VevVy 


Z 1 1 20 or 0.C5 0.95 + 0.05 


Frequency Table of Student Work Hours with 
Relative and Cumulative Relative Frequencies 


The last entry of the cumulative relative frequency 
column is one, indicating that one hundred percent 
of the data has been accumulated. 


NOTE 
Because of rounding, the relative frequency column 


may not always sum to one, and the last entry in 
the cumulative relative frequency column may not 
be one. However, they each should be close to one. 


[link] represents the heights, in inches, of a sample 
of 100 male semiprofessional soccer players. 


HEIGHTS FREQUENCYRELATIVE CUMULATIVE 
(INCHES) FREQUEN CYRELATIVE 


TDTATITFATMW 
PhpwWwubIvul 


60-61.99 5 5 100 = 0.05 
0.95 
62-63.99 3 3100 = 0.05 + 0.03 
0.92 — 9.02 
64-65.99 15 15100 = 0.08 + 0.15 
9.15 =19.23 
66-67.99 40 40100 = 0.23 + 0.40 
0,40 =10.63 
68-69.99 17 17100 = 0.63 + 0.17 
0.17 — 9,80 
70-71.99 12 12100 =~ 0.80 + 0.12 
0.12 =10-92 
72-73.99 7 7100 = 0.92 + 0.07 
0.07 = 9,99 
74-75.99 1 1100 = 0.99 + 0.01 
0.01 — 1,00 
Total = 100Total = 
1.00 


Frequency Table of Soccer Player Height 


The data in this table have been grouped into the 
following intervals: 


* 60 to 61.99 inches 
* 62 to 63.99 inches 
* 64 to 65.99 inches 
* 66 to 67.99 inches 
* 68 to 69.99 inches 
¢ 70 to 71.99 inches 


¢ 72 to 73.99 inches 
¢ 74 to 75.99 inches 


In this sample, there are five players whose heights 
fall within the interval 59.95-61.95 inches, three 
players whose heights fall within the interval 61.95- 
63.95 inches, 15 players whose heights fall within 
the interval 63.95-65.95 inches, 40 players whose 
heights fall within the interval 65.95-67.95 inches, 
17 players whose heights fall within the interval 
67.95-69.95 inches, 12 players whose heights fall 
within the interval 69.95-71.95, seven players 
whose heights fall within the interval 71.95-73.95, 
and one player whose heights fall within the 
interval 73.95—75.95. All heights fall between the 
endpoints of an interval and not at the endpoints. 


From [link], find the percentage of heights 
that are less than 65.95 inches. 


If you look at the first, second, and third rows, 
the heights are all less than 65.95 inches. 


There are 5 + 3 + 15 = 23 players whose 
heights are less than 65.95 inches. The 
percentage of heights less than 65.95 inches is 
then 23 100 or 23%. This percentage is the 
cumulative relative frequency entry in the 


third row. 


Try It 


[link] shows the amount, in inches, of annual 
rainfall in a sample of towns. 


Rainfall Frequency Relative Cumulative 


(Inches) Frequency Relative 
rioequency 
3-4.99 6 6 50 = 0.12 
O42 
5-6.99 7 750 = 0.12 + 
0.14 0.14 = 
O26 
7-9.99 15 1550 = 0.26 + 
0.30 0.30 = 
O56 
10-11.99 8 8 50 = 0.56 + 
0.16 0.16 = 
079 
12-12.99 9 950 = 0.72 + 
0.18 0.18 = 


Vevry 


13-14.99 5 5950 = 0.90 + 
0.10 0.10 = 


helsv 


Total = 50 Total = 
1.00 


From [link], find the percentage of rainfall 
that is less than 9.99 inches. 


Try It Solutions 


0.56 or 56% 


From [link], find the percentage of heights 
that fall between 61.95 and 65.95 inches. 


Add the relative frequencies in the second and 
third rows: 0.03 + 0.15 = 0.18 or 18%. 


Try It 


From [link], find the percentage of rainfall 
that is between 7.00 and 12.99 inches. 


Try It Solutions 


0.30 + 0.16 + 0.18 = 0.64 or 64% 


Use the heights of the 100 male 
semiprofessional soccer players in [link]. Fill 
in the blanks and check your answers. 


1. The percentage of heights that are from 
67.95 to 71.95 inches is: __. 

2. The percentage of heights that are from 
67.95 to 73.95 inches is: __. 

3. The percentage of heights that are more 
than 65.95 inches is: __. 

4. The number of players in the sample who 
are between 61.95 and 71.95 inches tall 
1S: ee 

5. What kind of data are the heights? 

6. Describe how you could gather this data 
(the heights) so that the data are 
characteristic of all male semiprofessional 
soccer players. 


Remember, you count frequencies. To find 
the relative frequency, divide the frequency by 
the total number of data values. To find the 
cumulative relative frequency, add all of the 


previous relative frequencies to the relative 
frequency for the current row. 


529% 
. 36% 
_/7% 
O72 


. quantitative continuous 
. get rosters from each team and choose a 
simple random sample from each 


Nineteen people were asked how many miles, to 
the nearest mile, they commute to work each day. 
The data are as follows: 25732101815 207 10 
1851213124510. [link] was produced: 
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Frequency of Commuting Distances 


1. Is the table correct? If it is not correct, 
what is wrong? 

2. True or False: Three percent of the people 
surveyed commute three miles. If the 
statement is not correct, what should it 
be? If the table is incorrect, make the 
corrections. 

3. What fraction of the people surveyed 
commute five or seven miles? 

4. What fraction of the people surveyed 
commute 12 miles or more? Less than 12 
miles? Between five and 13 miles (not 
including five and 13 miles)? 


1. No. The frequency column sums to 18, 
not 19. Not all cumulative relative 
frequencies are correct. 

2. False. The frequency for three miles 
should be one; for two miles (left out), 
two. The cumulative relative frequency 


column should read: 0.1052, 0.1579, 
0.2105, 0.3684, 0.4737, 0.6316, 0.7368, 
0.7895, 0.8421, 0.9474, 1.0000. 

Spols 

4.719, 1219, 719 


Try It 


[link] represents the amount, in inches, of 
annual rainfall in a sample of towns. What 
fraction of towns surveyed get at least 12 
inches of rainfall each year? 


Try It Solutions 


14 50 


Histograms 


In the introduction, the idea of distribution was 
introduced. The distribution refers to the shape, 
centre and variation of quantitative data. To 
determine the shape of the data, we need to look at 
a visual representation of the data. The best visual 
representation to look at is the histogram. 


Bar graphs and histograms look very similar. They 
both have bars whose heights represent the 
frequency of the data. But bar graphs are used for 
categorical data and discrete quantitative data (i.e. 
hole number data). Histograms are used for 
continuous quantitative data (i.e. numbers with 


decimals) and sometimes discrete quantitative data 
as well. Since there is a gap between categories and 

hole numbers, the bars in bar graphs do not 
touch. But for continuous data, there is no gap 
between the numbers, so the bars for histograms do 
touch. 


For most of the work you do in this book, you will 
use a histogram to display the data. One advantage 
of a histogram is that it can readily display large 
data sets. The following explains how to make a 
histogram by hand, but you can use statistical 
software to do this quite quickly. 


A histogram consists of contiguous (adjoining) 
boxes. It has both a horizontal axis and a vertical 
axis. The horizontal axis is labeled with what the 
data represents (for instance, distance from your 
home to school). The vertical axis is labeled either 
frequency or relative frequency (or percent 
frequency or probability). The graph will have the 
same shape with either label. The histogram (like 
the stemplot) can give you the shape of the data, the 


center, and the spread of the data. 


The relative frequency is equal to the frequency for 
an observed value of the data divided by the total 
number of data values in the sample.(Remember, 
frequency is defined as the number of times an 
answer occurs.) If: 


* f = frequency 

* n = total number of data values (or the sum of 
the individual frequencies), and 

¢ RF = relative frequency, 


then: 
RF = fn 


For example, if three students in Mr. Ahab's English 
class of 40 students received from 90% to 100%, 
then, f = 3,n = 40, and RF = fn = 340 = 0.075. 
7.5% of the students received 90-100%. 90-100% 
are quantitative measures. 


To construct a histogram, first decide how many 
bars or intervals, also called classes, represent the 
data. Many histograms consist of five to 15 bars or 
classes for clarity. The number of bars needs to be 
chosen. Choose a starting point for the first interval 
to be less than the smallest data value. A 
convenient starting point is a lower value carried 
out to one more decimal place than the value with 
the most decimal places. For example, if the value 
with the most decimal places is 6.1 and this is the 


smallest value, a convenient starting point is 6.05 
(6.1 —- 0.05 = 6.05). We say that 6.05 has more 
precision. If the value with the most decimal places 
is 2.23 and the lowest value is 1.5, a convenient 
starting point is 1.495 (1.5 - 0.005 = 1.495). If the 
value with the most decimal places is 3.234 and the 
lowest value is 1.0, a convenient starting point is 
0.9995 (1.0 - 0.0005 = 0.9995). If all the data 
happen to be integers and the smallest value is two, 
then a convenient starting point is 1.5 (2-0.5 = 
1.5). Also, when the starting point and other 
boundaries are carried to one additional decimal 
place, no data value will fall on a boundary. The 
next two examples go into detail about how to 
construct a histogram using continuous data and 
how to create a histogram using discrete data. 


The following data are the heights (in inches to the 
nearest half inch) of 100 male semiprofessional 
soccer players. The heights are continuous data, 
since height is measured. 

60; 60.5; 61; 61; 61.5 

GS°5}Oosns Oss 

64; 64; 64; 64; 64; 64; 64; 64.5; 64.5; 64.5; 64.5; 
64.5; 64.5; 64.5; 64.5 

66; 66; 66; 66; 66; 66; 66; 66; 66; 66; 66.5; 66.5; 
66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 
07210776075. 67507" 67367207 307-0767, 672075; 
07-53°07:5; 07.5; 07-9; 07.5;-67:5 


68; 68; 69; 69; 69; 69; 69; 69; 69; 69; 69; 69; 69.5; 
69.5; 69.5; 69.5; 69.5 

70S 702-702 70:7 0705 7 07527055. 70-527 aye 7A 
OAT DEY PL Does won PAS Hale a ie ae 

74 

The smallest data value is 60. Since the data with 
the most decimal places has one decimal (for 
instance, 61.5), we want our starting point to have 
two decimal places. Since the numbers 0.5, 0.05, 
0.005, etc. are convenient numbers, use 0.05 and 
subtract it from 60, the smallest value, for the 
convenient starting point. 

60 — 0.05 = 59.95 which is more precise than, say, 
61.5 by one decimal place. The starting point is, 
then, 59.95. 

The largest value is 74, so 74 + 0.05 = 74.05 is 
the ending value. 

Next, calculate the width of each bar or class 
interval. To calculate this width, subtract the 
starting point from the ending value and divide by 
the number of bars (you must choose the number 
of bars you desire). Suppose you choose eight bars. 
74.05 — 59.958 = 1.76 


OTE 

e will round up to two and make each bar or 
class interval two units wide. Rounding up to two 
is one way to prevent a value from falling on a 

oundary. Rounding to the next number is often 


ecessary even if it goes against the standard rules 
of rounding. For this example, using 1.76 as the 

idth would also work. A guideline that is 
followed by some for the width of a bar or class 
interval is to take the square root of the number o 
data values and then round to the nearest whole 

umber, if necessary. For example, if there are 
150 values of data, take the square root of 150 
and round to 12 bars or intervals. 


The boundaries are: 
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The heights 60 through 61.5 inches are in the 
interval 59.95-61.95. The heights that are 63.5 are 
in the interval 61.95-63.95. The heights that are 
64 through 64.5 are in the interval 63.95-65.95. 
The heights 66 through 67.5 are in the interval 
65.95-67.95. The heights 68 through 69.5 are in 
the interval 67.95-69.95. The heights 70 through 


71 are in the interval 69.95-71.95. The heights 72 
through 73.5 are in the interval 71.95-73.95. The 
height 74 is in the interval 73.95-75.95. 

The following histogram displays the heights on 
the x-axis and relative frequency on the y-axis. 
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Relative frequency 
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Heights 


Titles, labelling and numbering of visual 
representations 

isual representations should be numbered. As 
they are images, they would be numbered as 
figures. For example, a histogram would be 
numbered “Figure 3”. This means it is the third 
image in the document. This makes it easier to 
refer back to: “In Figure 3, we can see that ...” 
The title of the visual representation includes the 
name of the visual representation and the context: 


“Histogram of ...”. 
The label that goes along the axis includes the 
variable and the unit: Variable (unit). 
These three aspects combined will make it easy to 
refer to the image and let the reader of the image 
know what the image is about. 

frequency table would be similarly titled and 
labelled, but since it is a table and not an image, it 
would be referred to as “Table 4” (meaning the 
fourth table in the document). 

s you look through this textbook, notice how all 
of the images and tables are numbered as described 
above. 


Try It 


The following data are the shoe sizes of 50 
male students. The sizes are continuous data 
since shoe size is measured. Construct a 
histogram and calculate the width of each bar 
or class interval. Suppose you choose six bars. 


9; 9; 9.5; 9.5; 10; 10; 10; 10; 10; 10; 10.5; 
LOvos LOvd> T0353) 052 1057 10/52.0055 
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Smallest value: 9 

Largest value: 14 

Convenient starting value: 9 — 0.05 = 8.95 
Convenient ending value: 14 + 0.05 = 14.05 


14.05 —-8.95 6 =0.85 


The calculations suggests using 0.85 as the 
width of each bar or class interval. You can 
also use an interval with a width equal to one. 


Shape 


The shape of the data helps us understand what 
kind of pattern the data has. For example, if all of 
the data values have the same frequency, then the 
shape will be distinct (it is called uniform). If the 
data has a skew in it, then that helps us understand 
the measure of centre better (to be discussed in the 
next section). Overall, the shape helps us see how 
the data is behaving. Data that has similar shapes 
will behave in similar ways. 


The shape of the data set is determined by looking 
at a visual representation of the data and usually the 
histogram. Common ways of describing the shape 


include whether it is symmetrical or not, how many 
distinct peaks it has (unimodal, bimodal, 
multimodal), and whether the data has a tail only 
on one side (skew). 


Data is symmetric if the shape is same on both 
sides of centre. 

Skewed data has a "tail" on one side. This 
means that there are some data values that are 
far from the centre but only one one side. This 
is a type of non-symmetric data. 

For a histogram, the term "modal" refers to the 
number of distinct peaks. You almost want to 
think about mountain peaks. If there are 
multiple, distinct mountain peaks, then we say 
the data is multi-modal. If there is only one 
distinct peak, then the data is uni-modal. Not 
all data has a distinct peak. 

Uniform data occurs if the frequency of each 
interval is about the same. This will result in a 
flat looking histogram. 

A very important shape in statistics is the bell- 
curve (the shape in the first row, second 
column). This shape is symmetric, uni-modal 
and looks like a bell. If data has this shape (and 
satisfies a few other properties that will be 
discussed in Chapter 5), we call this data 
normal. 


Here are some examples of different shapes of data: 
Various shapes that data can have 


Here are some examples of possible shapes that data 
can take 


os 


Kh 


Bimodal skewed left Pa Bimodal skewed right 


Ba wed left with outlie: Symmetri I\ yutlies Skewed right with outlier: 


The above is provided to give you some ideas on 
how to describe the shape of data. But not all data 
sets have a nice shape that fits into one of the 
above. Sometimes the data can only be described as 
non-symmetric. 


How NOT to Lie with Statistics 


It is important to remember that the very reason we 
develop a variety of methods to present data is to 
develop insights into the subject of what the 
observations represent. We want to get a "sense" of 
the data. Are the observations all very much alike or 
are they spread across a wide range of values, are 
they bunched at one end of the spectrum or are they 
distributed evenly and so on. We are trying to get a 
visual picture of the numerical data. Shortly we will 


develop formal mathematical measures of the data, 
but our visual graphical presentation can say much. 
It can, unfortunately, also say much that is 
distracting, confusing and simply wrong in terms of 
the impression the visual leaves. Many years ago 
Darrell Huff wrote the book How to Lie with 
Statistics. It has been through 25 plus printings and 
sold more than one and one-half million copies. His 
perspective was a harsh one and used many actual 
examples that were designed to mislead. He wanted 
to make people aware of such deception, but 
perhaps more importantly to educate so that others 
do not make the same errors inadvertently. 


Again, the goal is to enlighten with visuals that tell 
the story of the data. Pie charts have a number of 
common problems when used to convey the 
message of the data. Too many pieces of the pie 
overwhelm the reader. More than perhaps five or six 
categories ought to give an idea of the relative 
importance of each piece. This is after all the goal of 
a pie chart, what subset matters most relative to the 
others. If there are more components than this then 
perhaps an alternative approach would be better or 
perhaps some can be consolidated into an "other" 
category. Pie charts cannot show changes over time, 
although we see this attempted all too often. In 
federal, state, and city finance documents pie charts 
are often presented to show the components of 
revenue available to the governing body for 
appropriation: income tax, sales tax motor vehicle 


taxes and so on. In and of itself this is interesting 
information and can be nicely done with a pie chart. 
The error occurs when two years are set side-by- 
side. Because the total revenues change year to year, 
but the size of the pie is fixed, no real information is 
provided and the relative size of each piece of the 
pie cannot be meaningfully compared. 


Histograms can be very helpful in understanding the 
data. Properly presented, they can be a quick visual 
way to present probabilities of different categories 
by the simple visual of comparing relative areas in 
each category. Here the error, purposeful or not, is 
to vary the width of the categories. This of course 
makes comparison to the other categories 
impossible. It does embellish the importance of the 
category with the expanded width because it has a 
greater area, inappropriately, and thus visually 
"says" that that category has a higher probability of 
occurrence. 


Changing the units of measurement of the axis can 
smooth out a drop or accentuate one. If you want to 
show large changes, then measure the variable in 
small units, penny rather than thousands of dollars. 
And of course to continue the fraud, be sure that the 
axis does not begin at zero, zero. If it begins at zero, 
zero, then it becomes apparent that the axis has 
been manipulated. 


Again, the goal of descriptive statistics is to convey 


meaningful visuals that tell the story of the data. 
Purposeful manipulation is fraud and unethical at 
the worst, but even at its best, making these type of 
errors will lead to confusion on the part of the 
analysis. 
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Chapter Review 


A bar graph is a chart that uses either horizontal or 
vertical bars to show comparisons among categories. 
One axis of the chart shows the specific categories 
being compared, and the other axis represents a 
discrete value. Some bar graphs present bars 
clustered in groups of more than one (grouped bar 
graphs), and others show the bars divided into 
subparts to show cumulative effect (stacked bar 
graphs). Bar graphs are especially useful when 
categorical data is being used, but they can also be 
used for quantitative discrete data. 


A histogram is a graphic version of a frequency 
distribution. The graph consists of bars of equal 
width drawn adjacent to each other. The horizontal 
scale represents classes of quantitative data values 


and the vertical scale represents frequencies. The 
heights of the bars correspond to frequency values. 
Histograms are typically used for large, continuous, 
quantitative data sets. 


The students in Ms. Ramirez’s math class have 
birthdays in each of the four seasons. [link] 
shows the four seasons, the number of students 
who have birthdays in each season, and the 
percentage (%) of students in each group. 
Construct a bar graph showing the percentage 
of students in each group. 


Seasons Number of Proportion of 
students population 
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David County has six high schools. Each school 
sent students to participate in a county-wide 
science competition. [link] shows the 
percentage breakdown of competitors from 
each school, and the percentage of the entire 
student population of the county that goes to 
each school. Construct a bar graph that shows 
the county-wide population percentage of 
students at each school. 


High School Science Overall 
competition student 
popilation popiwiation 
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Construct a histogram for the following: 


1. Pulse Rates for Frequency 


TAT ~ <2 7-2 
VV ULLITILI 


= hO 


2) 
> 
2) 
2) 
CD ko ke be fe be 
be 


120-129 i 


2. Actual Speedina Frequency 
9AM NANT 7.4... 2% 
VV lVITF iL OVIITS 

A; 


AO 
Ty 


“Joke bo 
= (nl 


co 


Vv 


RT 
vi 


58=61 


rR CD 


3. Tar (mg) in Frequency 
Nonfiltered 


12 


PO Fe ke Be 

bo CO 4~S CD 

ho bo Fe +k 

Gi kA SIC 
cn 


NON oom 


NO 
T 
NO 
\O 


Homework 

Use the following information to answer the next two 
exercises: Suppose one hundred eleven people who 
shopped in a special t-shirt store were asked the 
number of t-shirts they own costing more than $19 
each. 
40/111 
30/111 


20/111 


10/111 


Relative frequency 


i | 2 3 4 5 6 7 
Number of T-shirts costing more than $19 each 


The percentage of people who own at most 
three t-shirts costing more than $19 each is 
approximately: 


1. 21 
2.09 
3. 41 
4. Cannot be determined 


If the data were collected by asking the first 
111 people who entered the store, then the type 
of sampling is: 


. cluster 

. simple random 
. stratified 

. convenience 


BRwWNEH 


Glossary 


Frequency 
the number of times a value of the data 
occurs 


Histogram 
a graphical representation in x-y form of the 
distribution of data in a data set; x represents 
the data and y represents the frequency, or 
relative frequency. The graph consists of 
contiguous rectangles. 


Relative Frequency 
the ratio of the number of times a value of the 
data occurs in the set of all outcomes to the 


number of all outcomes 


Descriptive Statistics - Numerical Summaries of Data 
- MRU - C Lemieux 


By the end of this section, we want to be able to 
describe the distribution of quantitative data (i.e. 
shape, centre and variation). In the previous section, 
we looked at the shape of quantitative data. This 
section focuses on numerical summaries of data for 
quantitative data. In particular, it focuses on 
measures of centre and measures of variation. 


There are other numerical summaries of data called 
measures of location, which will be discussed in the 
next section. 

To find the mean of this data, we need to find the 
number that balances the data equally on both 
sides. To find the mean of this data, we need to find 
the number that balances the data equally on both 
sides. Notice that the mean here is not a data value. 


Measures of centre 


Measures of centre or average give us a sense of 
what a typical value in a data set is. For example, 
the average number of children in a family in 
Canada is 1.9. This means that a typical family will 
have about 1.9 children. Obviously, no family has 
exactly 1.9 children, but this gives a sense of how 
many children families have on average. Further, 
some families may have 8 children. Others may 


have no children. The measure of centre gives a 
sense of what is going on in the middle of the data 
set. 


Even though you may wish to round an average to 
a whole number (especially when it is about the 
number of people), this is not necessary nor is it 


appropriate as it is giving a sense of the centre of 
the data, which is not necessarily an actual data 
alue. 


The "center" of a data set is a way of describing a 
typical value in a data set. The three most widely 
used measures of the "center" of the data are the 
mean, median and mode. 


To explain these three measures of centre, let’s look 
at an example. Suppose we want to find the average 
weight of 50 people. To calculate the mean weight 
of the 50 people, we would add the 50 weights 
together and divide by 50. To find the median 
weight of the 50 people, order the data from least 
heavy to most heavy, and find the weight that splits 
the data into two equal parts. The mode is the most 
commonly occurring value. To find the mode, find 
the weight that occurs the most frequently. 


This section provides more details on how to find 
the measures of centre, the notation for the 
measures, and when it is best to used which 
measure. 


NOTE 

Though the words “mean” and “average” are 
sometimes used interchangeably, they do not 
necessarily mean the same thing. In general, 
“average” is any measure of centre and “mean” is a 


specific type of centre. Many people use average 
and mean as the same, but not always. For 
example, when people talk about average housing 
price, they are usually referring to the median 
house price. 


Mean 


The mean of a data set can be thought of as a 
balancing point (or fulcrum). If you think of 
numbers as weighted, then the mean is the number 
that will balance the data values evenly. Suppose 
your data values are 1, 2, 3, 4, 5. Then the number 
that balances the data is 3. To go a little deeper, the 
balance point is three because the distance between 
3 and the data values less than it is equal to the 
distance between 3 and the data values greater than 


it as shown in [link]. 


Let's try a harder example. Suppose our data values 
are 0, 1, 1, 2, 3, 3, 4, 6. The mean will be the 
number such that the total distance to the data 
values below it and the total distance to the data 
values above it are the same. Let's see 3 is the mean 
again. Then the distance between our suggested 
"mean" and 0 is 3; the distance between our "mean" 
and 1 is 2 (but there are two of them); and the 
distance between our "mean" and 2 is 1. That is, the 
distance between our "mean" and all of the data 
values below it are 3+2+2+1 = 8. If 3 is actually 
our mean, then the total distance between 3 and the 
data values above it will also be 8. Let's check. The 
distance between our "mean" and 4 is 1; the distance 
between our "mean" and 6 is 3. The total distance 
above 3 is only 4. Therefore, 3 cannot be our mean 
as it doesn't balance our data. 


The two data values of 3 were ignored as their 
distance from the suggested mean is 0. Therefore, 


they would not change the answer if included. 


From our calculations above, the choice of 3 was 
too big as the lower was too heavy. Let's try 2.5 as 
our mean. If the mean is 2.5, then the distance 
between our "mean" and 0 is 2.5; the distance 
between our "mean" and 1 is 1.5 (but there are two 
of them); the distance between our "mean" and 2 is 
0.5. Thus the total distance between our mean of 
2.5 and the data values below is is 2.5 + 1.5 + 1.5 
+ 0.5 = 6. If 2.5 is our mean, then the total 
distance above 2.5 should also be 6. The distance 
between our "mean" and 3 is 0.5 (but there are two 
of them); the distance between our "mean" and 4 is 
1.5; the distance between our "mean" and 6 is 3.5. 
Thus the total distance between the data values and 
our suggested mean of 2.5is 0.5 + 0.5 + 1.5 + 3.5 
= 6! Therefore, 2.5 is the mean for this data. 


Thankfully we don't have to do these in-depth 
calculations and guesses each time. Instead the 
formula is pretty straight-forward. 


The Greek letter 1: (pronounced "mew") represents 
the population mean. That is, it is the mean for the 
population data. 

Formula for Population Mean 
u = INYi=1Nxi 


The letter used to represent the sample mean is an 
x with a bar over it (pronounced “x bar”): x”. It is 
the mean of a sample of data from the population. 


The sample mean is an estimate of the population 

mean. One of the requirements for the sample 

mean to be a good estimate of the population 

mean is for the sample taken to be truly random. 
Formula for Sample Mean 

x- = Indvi=1nxi 


To see how the formula words, consider the sample: 
1; 1; 1; 2; 2; 3; 4; 4; 4; 4; 4 
x-=1+1+1+2+2+34+44+44+4+4+4 
+ 411 = 2.7 


Note: Since it is sample data, we use the symbol x. 


pplication of the law of large numbers 


If the size of a random sample is increased, then 
the sample mean will more likely be a better 
estimate of the population mean. 

Note: Just because the sample size increases does 
not mean that the sample mean for the larger 
sample must be a better estimate. It is only that it 
is more likely to be a better estimate. 


Median 


On a road, the median is in the middle of the road. 
In statistics, the median is the middle data value 
(when the data is in order). 


You can quickly find the location or position of the 
median by using the expressionn + 12. 


The letter n is the total number of data values in the 
sample. If n is an odd number, the median is the 
middle value of the ordered data (ordered smallest 
to largest). If n is an even number, the median is 
equal to the two middle values added together and 
divided by two after the data has been ordered. For 
example, if the total number of data values is 97, 
thenn + 12 = 97 + 12 = 49. The median is the 
49th value in the ordered data. If the total number of 
data values is 100, thenn + 12 = 100 +12 = 
50.5. The median occurs midway between the 50th 
and 51st values. The location of the median and the 


value of the median are not the same. The upper 
case letter M is often used to represent the median. 
The next example illustrates the location of the 
median and the value of the median. 


Mode 


Another measure of the center is the mode. The 
mode is the data value that occurs most frequently 
and at least twice. 


A data set can have either 


* no mode. 

* one mode (unimodal) 

* two modes (bimodal) 

* or many modes (multimodal). 


Consider the statistics exam scores for 20 students: 
5053595963637 27 2727272767881838484849093 
The most frequent score is 72, which occurs five 
times. Mode = 72. 


Note 
The mode can be calculated for qualitative data as 


ell as for quantitative data. For example, if the 
data set is: red, red, red, green, green, yellow, 
purple, black, blue, the mode is red. 


AIDS data indicating the number of months a 
patient with AIDS lives after taking a new 
antibody drug are as follows (smallest to 
largest): 


33 43-63 8; 105 Ly 23 3. 4s 55-15: 7163165 
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44; 44; 47; 

Calculate the mean, median and mode. 


The calculation for the mean is: 


x-=[3+4+(8)(2)+10+11+4+ 12 
+13+14+(15)(2)+(€16)(2) +... 
+ 35 + 37 + 40 + (44)(2) + 47] 40 = 
23.6 

To find the median, M, first use the formula 
for the location. The location is: 
n+12= 40+ 12 = 20.5 

Starting at the smallest value, the median is 
located between the 20th and 21st values (the 
two 24s): 

AO. Oe a2 LS bee 1 Sey 16; G6; 
D7 GON 22 Oe ae Ot Gt 27; 
DAT (EP 5 OAS Vike M Eke PASN Be Pare te inte: NS bao yo ero Laat 10 
44; 44; 47; 


M = 24 + 242 = 24 To find the mode, we 
first have to determine if any data values 


repeat. If no data values repeat, there is no 
mode. Since 8 repeats, we know there is a 
mode. 8 repeats twice. We need to check if any 
data value repeats more than twice. If a data 
value repeats more than twice, then it is the 
mode. Since no data value repeats more than 
twice, any data value that repeats twice is the 
mode. 

Therefore, the modes are 8, 15, 16, 17, 22, 24, 
26, 27, 29, 34, 44. This data set is multi- 
modal. 


Suppose that in a small town of 50 people, one 
person earns $5,000,000 per year and the 
other 49 each earn $30,000. Which is the 


better measure of the "center": the mean, the 
median or the mode? 


X — = 5,000,000 + 49(30,000) 50 =129,400 
M = 30,000 


(There are 49 people who earn $30,000 and 
one person who earns $5,000,000.) 


The mode is 30,000 as this data value occurs 
49 times. 


Since the median and mode are equal, lets 
focus on the median. The median is a better 


measure of the "center" than the mean because 
49 of the values are 30,000 and one is 
5,000,000. The 5,000,000 is an outlier. The 
30,000 gives us a better sense of the middle of 
the data. 


The above example highlights two important ideas: 


* Outliers: We have defined outliers as data 
values that are significantly different from 
other data values, but we have not provided a 
way of finding them. This will be discussed in 
the next section. Regardless, we can see that 5 
million is significantly different than 30 
thousand in the above example. 

Skew: When a data set has outliers, the 
outliers have the potential to skew the mean. 
In the above example, the centre of the data is 
30,000, but the mean is 129,400. Thus the 
outlier of 5 million is pulling the mean up. 
That is, it is skewing the centre value by 
pulling it to the right on the number line. 


Comparing measures of centre 


Above we have described how to find each of the 
measures of centre. But how do you choose which 
measure of centre to use in which situation? One 
option is to provide all three measures of centre, but 
sometimes this can be overwhelming to the 
audience. Instead you want to pick the best one that 
best describes that data. The following are some 
general guidelines for choosing the best measure of 
centre. 


The mean is often the best measure of centre to use 
because it is the most well-known and familiar of 
the measures of centre. It is also the only measure of 
centre that is computed using all of the sample 
values. But the mean is susceptible to outliers. As 
was seen in [link], if there is an outlier, the mean 
can be pulled in one direction away from the centre. 


Outliers are any data value that are significantly 
different from the other data values. In [link], the 
outlier is 5 million as it is significantly higher than 
the other data values. We will discuss how to find 
outliers in the section 2.3 (Boxplots). 


If there is an outlier in the data set that is skewing 
the mean, the best measure of centre to use is the 
median as it is not susceptible to outliers. 


But be careful: The presence of outliers does not 
necessarily mean that the median is the best 
measure of centre. Here are a couple of examples 


where this is the case: 


1. Suppose there are 200 data values in a sample 
and one data value is an outlier, then the mean 
will most likely not be affected by the outlier. 

2. Suppose there is a data set that has outliers, but 
one is a high outlier and one is a low outlier. 
Then the outliers may balance out and not 
affect the mean. 


The mode is best used for categorical data, but can 
sometimes be used for quantitative data. For 
example, in [link], the mode would be a good 
measure of centre because the majority of data 
values are the same. 


In [link], since there are no outliers, the mean is the 
best measure of centre to use. In [link], since there 
is an outlier (5 million) and the mean and median 
are quite different, the median is the best measure 
of centre to use. 


The following tables compare the measures of 
centre. 
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How to mislead with averages 


Consider the following situation: As you arrive at an 
open house in your preferred new home location, a 
neighbour comes up to you while he is walking his 
dog. “This is a great neighbourhood to live in! The 
average income in this neighbourhood is $60,000,” 
he tells you. You are pleased to hear how affluent 


the community is. A year after you’ve moved into 
your new home, the same neighbour comes to your 
door and asks you to sign a petition. “The city is 
overvaluing the homes in our neighbourhood again, 
which means more taxes. The average income in 
this neighbourhood is $20,000. We can’t afford 
these increases.” You dutifully sign the petition 
because you don’t want to pay more taxes, but 
you’re also confused. Wasn’t the average income a 
lot higher last year? What happened? Is your 
neighbour a liar? In this example, there are many 
different possible scenarios that could explain the 
discrepancy. But no matter what the scenario is, the 
neighbour is picking his statistics to fit his situation. 


One scenario: The neighbour may be picking and 
choosing which measure of centre to use. Suppose 
that most people in the neighbourhood make 
around $20,000 a year, but there are a few people 
who live on the street with the super nice view who 
make $300,000 a year. Then in the first case, when 
he says the average income is $60,000, he has used 
the mean which has been pulled higher by the 
outliers of $300,000. He chose to use the mean to 
make the neighbourhood look more affluent than it 
really is. 


But when he wanted to make the argument that the 
neighbourhood wasn’t as affluent and should be in a 
lower tax bracket, he changed which measure of 
centre to use. Instead he may have the used the 


median or mode because they aren’t influenced by 
the outliers. 


Another scenario: The neighbour may be choosing 
how he defines income to help make his point. In 
the first case, he may have only used those who are 
employed to come up with the average salary. While 
in the second case, he may have used all adults in 
the neighbourhood including students living with 
their parents, stay-at-home parents, retired people 
or people out of work. Their incomes may be very 
low or non-existent which would skew the average 
to being lower. In this scenario, he may be using the 
same measure of centre, but is picking what he 
means by income to get the results he wants. 


There are other possible scenarios. Can you think of 
any? 


Skew 


As has been noted above, if there are outliers in a 
data set, this can cause the mean to be pulled up or 
down (i.e. be either higher than expected or lower 
than expected) by these outliers. Outliers don't have 
to be present for this to happen. Essentially, any 
time that there are data values that cause the mean 
and median to be significantly different, then we say 
the data is skewed. 


¢ If the mean is significantly larger than the 


median and the histogram has a long tail on the 

right, then the data is right skewed or 

positively skewed. 

If the mean is significantly smaller than the 

median and the histogram has a long tail on the 

left, then the data is left skewed or negatively 

skewed. 

* If the mean and the median are approximately 
the same and the histograms has balanced tails, 
then the data is symmetric. 


Examples of skewness and symmetry 

These are "perfect" examples of skewness and 
symmetry. In reality, there may be multiple modes 
or the mean and median will be similar but not 
equal. These are provided to give an example. 
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Measures of variation 


An important characteristic of any set of data is the 
variation in the data. In some data sets, the data 
values are concentrated closely near the mean; in 
other data sets, the data values are more widely 
spread out from the mean. There are five measures 
of variation: range, standard deviation, variance, 


interquartile range and coefficient of variation. 


The range is the easiest to calculate. It is found by 
subtracting the maximum value in the data set from 
the minimum value in the data set. Though the 
range is easy to calculate, it is very much affected 
by outliers. 


The interquartile range will be discussed in the 
section on box plots (section 2.3). 


The most common measure of variation, or spread, 
is the standard deviation. The standard deviation 
measures how far data values are from their mean, 
on average. 


ariation within a sample vs. variation between 
samples 
When talking about variable or variability in 
statistics, there are two different kinds: variation 
within a sample and variation between samples. 
When we discuss finding the standard deviation, 
range or any measure of variation of a sample, we 
are discussing variation within a sample. In this 
case, we are looking at how the data values vary 
from each other. Most of the time, when we talk 
about variation this is what we are talking about. 
We can also talk about how much different samples 
vary from each other. For example, we could take 


multiple samples and find the sample mean of each 
sample. If we talk about how much the means vary 
from each other, we are discussing variation 
between samples. We will discuss this specific type 
of variation in Chapter 6. 
The law of large numbers saws that, for random 
samples, as the sample size increases, then the 
sample will more closely resemble the population. 
For example, as the sample size increases, the 
sample standard deviation will approach the 
population standard deviation. Thus, the variation 
ithin the sample will more closely mimic the variation 
ithin the population as the sample size increases. But 
as the sample size increases, the sample means will 
approach the population mean. Thus, there will be 
less variation between the sample means. This 
means that the variation between samples decreases, 
as the sample size increases. When we discuss 
sampling variability, we are discussing variation 
between samples. 
For this chapter, we are focusing on variation 
within a sample. 


The standard deviation (and variance) 


* provides a numerical measure of the overall 
amount of variation in a data set, and 

* can be used to determine whether a particular 
data value is close to or far from the mean. 


The standard deviation provides a measure of the overall variation in a 
data set 


The standard deviation is small when the data are 
all concentrated close to the mean, exhibiting little 
variation or spread. The standard deviation is larger 
when the data values are more spread out from the 
mean, exhibiting more variation. 


Suppose that we are studying the amount of time 
customers wait in line at the checkout at 
supermarket A and supermarket B. It is known that 
the average wait time at both supermarkets is about 
five minutes. At supermarket A, though, the 
standard deviation for the wait time is two minutes; 
at supermarket B the standard deviation for the wait 
time is four minutes. 


Because supermarket B has a higher standard 
deviation, we know that there is more variation in 
the wait times at supermarket B. Overall, wait times 
at supermarket B are more spread out from the 
average; wait times at supermarket A are more 
concentrated near the average. This means that at 
supermarket B, you have a greater chance of having 
a short wait time, but also a greater chance of 
having a long wait time, compared to supermarket 
A. That means the wait times are more volatile at 
supermarket B. On the other hand, you will be 
waiting about the same amount of time at 
supermarket A. That means there are more 
consistent waits times at supermarket A. 


One way, we could summarize the supermarket 
situation is as follows: 


¢ A typical wait time at supermarket A is 5 
minutes give or take 2 minutes. This means 
that someone typically has to wait 3 to 7 
minutes in the checkout line. 

¢ A typical wait time at supermarket B is 5 
minutes give or take 4 minutes. This means 
that someone typically has to wait 1 to 9 
minutes in the checkout line. 


Here the term “typical” means common, normal. So 
normally people will wait between 3 to 7 minutes at 
supermarket A, but there will be some people who 
only wait 2 minutes and some who wait 10 minutes 
at the checkout. That is, the typical range only 
provides a sense of what is going on in the middle of 
the data, but there are values occurring outside of 
that range. 


For the typical value, you can use any measure of 
centre. But for the give or take value, you have to 


use standard deviation. No other measure of 
ariation works. 


Calculating the Standard Deviation 


The following explains how to calculate the 
standard deviation by hand. We will be using 
computer software to do this. Thus it is not 


important to know this section in detail, but it is 
helpful to know the basics of how the standard 
deviation is calculated to help understand what the 
standard deviation is. 


If x is a number, then the difference "x — mean" is 
called its deviation. In a data set, there are as many 
deviations as there are items in the data set. The 
deviations are used to calculate the standard 
deviation. If the numbers belong to a population, in 
symbols a deviation is x — yz. For sample data, in 
symbols a deviation is x-x-. 


The procedure to calculate the standard deviation 
depends on whether the numbers are the entire 
population or are data from a sample. The 
calculations are similar, but not identical. Therefore 
the symbol used to represent the standard deviation 
depends on whether it is calculated from a 
population or a sample. The lower case letter s 
represents the sample standard deviation and the 
Greek letter o (sigma, lower case) represents the 
population standard deviation. If the sample has the 
same characteristics as the population, then s should 
be a good estimate of o. 


To calculate the standard deviation, we need to 
calculate the variance first. The variance is the 
average of the squares of the deviations (the x — 
x — values for a sample, or the x — yp values for a 
population). The symbol o2 represents the 
population variance; the population standard 
deviation o is the square root of the population 
variance. The symbol s2 represents the sample 
variance; the sample standard deviation s is the 
square root of the sample variance. You can think of 
the standard deviation as a special average of the 
deviations. 


If the numbers come from a census of the entire 
population and not a sample, when we calculate 
the average of the squared deviations to find the 
variance, we divide by N, the number of items in the 
population. If the data are from a sample rather 
than a population, when we calculate the average of 
the squared deviations, we divide by n — 1, one less 
than the number of items in the sample. 


Formulas for the Sample Standard Deviation 


*s= U(x —x-)2n-1 
* For the sample standard deviation, the 
denominator is n - 1, that is the sample size - 1. 


Formulas for the Population Standard Deviation 


-o = L(x-w2N 


¢ For the population standard deviation, the 
denominator is N, the number of items in the 
population. 


Since the standard deviation is found by square 
rooting something, the standard deviation is always 
positive or zero. 


Since the variance is the square of the standard 
deviation, it is not helpful as a descriptive statistic. 
For example, if you are looking at the weights of 
basketballs in kg, then the standard deviation will 
be in kg, while the variance will be in kg*2. Thus 
the variance is meaningless when trying to interpret 
the variation in data. It is helpful later on in 
statistics, but at this point it is not. 


In a fifth grade class, the teacher was interested in 
the average age and the sample standard deviation 
of the ages of her students. The following data are 
the ages for a SAMPLE of n = 20 fifth grade 
students. The ages are rounded to the nearest half 
year: 

O29 5:95) 10. 16. 10. 0210's: 10252 10:55 10-5: 
iF de Boat UP Kae aU Lege Ua Is Ud lal Us ed Bl ts 


9 + 9.5(2) + 10(4) + 10.5(4) + 11(6) + 11.5(3) 
20 =10.525 
The average age is 10.53 years, rounded to two 


places. 

The variance may be calculated by using a table. 
Then the standard deviation is calculated by taking 
the square root of the variance. We will explain the 
parts of the table after calculating s. 
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0.475 0.225625 = 


[2 1 9 - (-1.525)2 1 x 
$9.525 = = 2.325625 

| =—1.525 | 2.325645 = 
2.325625 

bs Zz 9.5 - (-1.025)2 2 x 
$9.525 = = 1.050625 

| -1.025 1.050625 = 
2.101250 

[io 4 10- (-0.525)2 4 x 
49.525 = = 0.275625 | 
0.5251 09756235 =—11.1025 

[io.s 4 10.5 - (-0.025)2 4 x 
| $9.525 — = 0.000625 | 
9.925 90,900625 — 9.0025 

[ua 6 }1-— (0.475): 6 x 
| $9,525 = = 0.225625 


10.525 = = 0.950625 


jis 3 11.5- (0,975): 3x | 
| 0.975 0.950625 = 
I 


The sample variance, s2, is equal to the sum of the 
last column (9.7375) divided by the total number 
of data values minus one (20 - 1): 

s2 = 9.7375 20-1 =0.5125 

The sample standard deviation s is equal to the 
Square root of the sample variance: 

s= 0.5125 =0.715891, which is rounded to two 
decimal places, s = 0.72. 


Explanation of the standard deviation calculation shown in the table 


The deviations show how spread out the data are 
about the mean. The data value 11.5 is farther from 
the mean than is the data value 11 which is 
indicated by the deviations 0.97 and 0.47. A 
positive deviation occurs when the data value is 
greater than the mean, whereas a negative deviation 
occurs when the data value is less than the mean. 
The deviation is —1.525 for the data value nine. If 
you add the deviations, the sum is always zero. 


(For [link], there are n = 20 deviations.) So you 
cannot simply add the deviations to get the spread 
of the data. By squaring the deviations, you make 
them positive numbers, and the sum will also be 
positive. The variance, then, is the average squared 
deviation. 


The variance is a squared measure and does not 
have the same units as the data. Taking the square 
root solves the problem. The standard deviation 
measures the spread in the same units as the data. 


Notice that instead of dividing by n = 20, the 
calculation divided by n- 1 = 20-1 = 19 because 
the data is a sample. For the sample variance, we 
divide by the sample size minus one (n — 1). Why 
not divide by n? The answer has to do with the 
population variance. The sample variance is an 
estimate of the population variance. Based on the 
theoretical mathematics that lies behind these 
calculations, dividing by (n — 1) gives a better 
estimate of the population variance. 


The standard deviation, s or 0, is either zero or 
larger than zero. When the standard deviation is 
zero, there is no spread; that is, the all the data 
values are equal to each other. The standard 
deviation is small when the data are all 
concentrated close to the mean, and is larger when 
the data values show more variation from the mean. 
When the standard deviation is a lot larger than 


zero, the data values are very spread out about the 
mean; outliers can make s or o very large. 


Coefficient of variation 


The standard deviation is a very good measure of 
variation, but when comparing two data sets it is 
not always the best. In particular, if the means of 
the two data sets are different. Suppose you are 
comparing the yearly salaries (excluding bonuses) of 
junior employees versus CEOs at oil and gas 
companies around Alberta. The yearly salaries for 
the junior employees will be significantly smaller 
than the CEOs. Let’s say the average salary for 
junior employees is $45,000 while for CEOs is 
$500,000. Now suppose that the standard deviation 
for both groups is $50,000. If we only looked at the 
standard deviation, we might say that the variation 
in both groups is the same. But really variation of 
$50,000 when the average salary is $45,000 is quite 
a bit more than for a salary of $500,000. That is, 
there is more relative variation in the junior 
employees’ salary. The standard deviation doesn’t 
capture this difference. But the coefficient of 
variation does and is a measure of relative 
variation. That is, it takes into account that bigger 
data values might have a larger standard deviation, 
but that doesn’t mean it has larger variation. 


The coefficient of variation is found by expressing 
the standard deviation as a percentage of the mean: 


Coefficient of Variation = s x — (100%) 


In the above example, the coefficient of variation 
would be: 

CofV for Junior employees = 50,000 45,000 
(100%) =111.1% 

CofV for CEOs = 50,000 5,000,000 (100%) =1% 


The larger the coefficient of variation, the larger the 
relative variation. Thus, as a measure of relative 
variation, the junior employees have significantly 
more relative variation (111.11%) compared to the 
CEOs (1%). 


Here are some points about the coefficient of 
variation: 


* The coefficient of variation is not affected by 
multiplicative changes of scale. 

* The coefficient of variation is used to 
compare variation between data sets. This is 
very important to remember. For multiple data 
sets, if the means are the same, you can 
compare the standard deviations. BUT if the 
means are different, you MUST use the 
coefficient of variation of compare the 
variation in the data sets. 

¢ If the standard deviation is larger than the 
mean, the coefficient of variation is bigger than 
100%. 


Range 


Standard deviation 


Variance 


Interquartile range 


Wei to use 

The range is rarely the 
best measure of variation 
to use. But it is a good 


quick calculation of 
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Similar to the mean, this 
is the most common 
measure of variation. 
Also, it is derived from 
the mean. Therefore, if 
your best measure of 
centre is the mean, then 
the standard deviation is 
a good complement to it. 
Further, it is best used 
when finding the 
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As it the square of the 
standard deviation, it is 
NEVER the best measure 
of variation to use. It is 
helpful in later topics in 


. . 
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This is a not very well 
known measure of 
variation, but it is helpful 
in describing the range 
for middle 50% of the 
data values. Further, it is 
based on measures of 


Coefficient of variation 


location. Therefore, if 
your best measure of 
centre is the median, then 
the IQR is a good 
complementary measure 
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This is not well known, 
but it is useful for giving 
a context free 
interpretation of 
variation. It is the best 
measure to use when 
comparing the variations 
of two or more data sets 
that have different 
measures of centre. 


When to use which measure of variation 


Suppose you are looking at two companies and 
each company has 24 employees. At one company, 
everybody except the CEO makes $30,000. The 
CEO makes $490,000. Thus, the data values would 


be 


$30,000; $30,000; $30,000; $30,000; $30,000; ... ; 


$490,000 


The second company has an interesting policy. 
Everybody who starts at the company makes 
$30,000 a year, but as soon as someone else gets 


hired, they get paid $20,000 more. They only hire 
one person at a time. So, the first person who was 
hired started at $30,000, then when a second 
person got hired, the first person’s salary was 
raised to $50,000. When a third person got hired, 
the first person’s salary was raised to $70,000 
while the salary of the second person hired was 
raised to $50,000. This has been done 23 times. 
Therefore, their data values (i.e. salaries) would 
look like this: 

$30,000 $50,000; $70,000; $90,000; $110,000; ... 
5$490,000 

Without doing any calculations, we can see that 
company one has fairly consistent salaries except 
for the CEO. While company two has salaries that 
are more spread out. 

The following table provides the count (i.e. sample 
size), mean, and the measures of variation for the 
two companies. 
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Coefficient of 190.98% 54.39% 
ariation 


In the table above, notice that the range is the 
same for the two data sets. If we only looked at the 
range, this would give a false sense that the 
amount of variation in the two data sets is the 
same, but we know it isn’t. 

The standard deviation is measuring how much, on 
average, the data values vary from the mean. For 
company one, 23 of the 24 data values deviate the 
Same amount from the mean ($49,166.67 — 
$30,000 = $19,166.67) with only the $490,000 
deviating a large amount from the mean. 

For company two, two data values only deviate by 
only $10,000 ($250,000 and $270,000) while two 
data values deviate by a whopping $230,000 
($30,000 and $490,000). 

In company one, 23 out of 24 data values deviate 
by less than $20,000. But for company two, only 2 
out of 24 deviate by less than $20,000. This 
suggests that company one will have a smaller 
standard deviation than company two because 
there is less average deviation. This is supported by 
MegaStat, which shows that the population 
standard deviation for company one is $91,920.10 
versus company two, which has a population 
standard deviation of $138,443.73. 

Notice that even though company one has an 
outlier (the CEO’s salary), the standard deviation is 
less than company two. That is, the average 


variation from the mean is less for company one. 
Thus, the presence of an outlier does not necessarily 
result in a larger standard deviation. 

The story is different when we look at the 
coefficient of variation. For company one, it is 
190.98%. While for company two, it is 54.39%. 
This means that company one has larger relative 
variation than company two. This is because 
company two has a higher mean than company one 
and thus the variation, relative to the mean, isn’t as 
large as it is in company one. 

In this situation, the best measure of variation to 
use would be the coefficient of variation as we are 
comparing two data sets with two different means. 
Based on this, company one has larger relative 
variation than company two. 

Notice that variance is not discussed here. As 
stated above, the variance is the square of the 
standard deviation. Therefore, the units for 
variance in this example would be $°2, which 
makes no sense. Again, variance is not a useful 
descriptive statistic. 


Common Mistake 


ariation and variance might seem like the same 

ord but they aren’t. Variation is a general term 
used to discuss how much the data values vary 
from each other, how much spread there is in the 
data, how consistent the data is, how volatile or 


risky the data is, and how much deviation there is 
in the data values. It is an umbrella term. Variance 
is a specific type of variation. It specifically refers 
to the square of the standard deviation. Therefore, 
it is incorrect to say, “There is a lot of variance in 
the data” or “The best measure of variance is ...”. 


Optional section: Comparing Values from Different Data 
Sets 


The standard deviation is useful when comparing 
data values that come from different data sets. If the 
data sets have different means and standard 
deviations, then comparing the data values directly 
can be misleading. 


¢ For each data value, calculate how many 
standard deviations away from its mean the 
value is. 

* Use the formula: value = mean + 
(#ofSTDEVs)(standard deviation); solve for 
#ofSTDEVs. 

* #ofSTDEVs= value — mean standard deviation 

* Compare the results of this calculation. 


#ofSTDEVs is often called a "z-score"; we can use 
the symbol z. In symbols, the formulas become: 
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Two students, John and Ali, from different 
high schools, wanted to find out who had the 
highest GPA when compared to his school. 
Which student had the highest GPA when 
compared to his school? 


Student GPA School School 
Mean GFA Standard 
0 O7 LGAULVLI 


Q 
Ali Fh 80 10 


For each student, determine how many 
standard deviations (#ofSTDEVs) his GPA is 
away from the average, for his school. Pay 
careful attention to signs when comparing and 
interpreting the answer. 


z= # of STDEVs= value —mean 


standard deviation = x-u o 


For John, z= #o0fSTDEVs= 2.85-3.0 0.7 =- 
0.21 


For Ali, z= #ofSTDEVs= 77—80 10 =-0.3 


John has the better GPA when compared to his 
school because his GPA is 0.21 standard 
deviations below his school's mean while Ali's 
GPA is 0.3 standard deviations below his 
school's mean. 


John's z-score of —0.21 is higher than Ali's z- 
score of —0.3. For GPA, higher values are 
better, so we conclude that John has the better 
GPA when compared to his school. 


Try It 


Two swimmers, Angie and Beth, from different 
teams, wanted to find out who had the fastest 
time for the 50 meter freestyle when compared 
to her team. Which swimmer had the fastest 
time when compared to her team? 


Swimmer Time Team Team 


(seconds) Mean Standard 
La_p be fe Cee 
LLILLU WUVIatLlu il 
Angic mS Dee C.8 
Beth IF3 30.1 1.4 


For Angie: zg = 26.2 — 27.2 0.8 = -1.25 


For Beth: z = 27.3-30.1 1.4 = -—2 


Distributions 


Now that we have learned about determining shape 
(histogram), centre (mean, median or mode), and 
variation (standard deviation, coefficient of 
variation and range), we can now describe the 
distribution of a data set. 


In [link], we examined the salaries for two different 
companies. 


Though we have not done the histogram for either 
of these data sets, we can imagine what they will 
look like to determine the shape. Company A will 
have one peak at $30,000 with an outlier at 


$490,000. This will make it skewed to the right. For 
Company B each data value has the same frequency, 
which makes the data uniform. 


For company A, we would describe the distribution 
of salaries to be skewed to the right(shape), centred 
at $49,166.67 (mean) and have variation of 
$91,820.10 (standard deviation). 


For company B, we would describe the distribution 
of salaries to be uniform(shape), centred at 
$260,000 (mean) and have variation of $138,443.73 
(standard deviation). 
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Chapter Review 


The mean and the median can be calculated to help 


you find the "center" of a data set. The mean is the 
best estimate for the actual data set, but the median 
is the best measurement when a data set contains 
several outliers or extreme values. The mode will 
tell you the most frequently occuring datum (or 
data) in your data set. The mean, median, and mode 
are extremely helpful when you need to analyze 
your data, but if your data set consists of ranges 
which lack specific values, the mean may seem 
impossible to calculate. However, the mean can be 
approximated if you add the lower boundary with 
the upper boundary and divide by two to find the 
midpoint of each interval. Multiply each midpoint 
by the number of values found in the corresponding 
range. Divide the sum of these values by the total 
number of data values in the set. 


The standard deviation can help you calculate the 
spread of data. There are different equations to use 
if are calculating the standard deviation of a sample 
or of a population. 


¢ The Standard Deviation allows us to compare 
individual data or classes to the data set mean 
numerically. 

*s=2 (x-— x-)2n-lors=2 f(x-—x-)2 
n-—1 is the formula for calculating the standard 
deviation of a sample. To calculate the 
standard deviation of a population, we would 
use the population mean, pu, and the formula 0 
=> (x-w)2Noro => f(x-w2N. 


Use the following information to answer the next three 
exercises: The following data show the lengths of 
boats moored in a marina. The data are ordered 
from smallest to largest: 
161719202021232425252526262727 27282930323 
3333435373940 


Calculate the mean. 


Mean: lo- 17°F 19-2 20 <b -20 21. = 23 
24 FP Zoe 20 2 SF 26 sb 26. ce ea 27 
eT EZ 29 30 b 32 Bo Ro 4534 
+ 35 + 37 + 39 + 40 = 738; 


738 27 = 27.33 
Identify the median. 
Median = 27 
Identify the mode. 


The most frequent lengths are 25 and 27, which 
occur three times. Mode = 25, 27 


Use the following information to answer the next three 
exercises: Sixty-five randomly selected car 
salespersons were asked the number of cars they 
generally sell in one week. Fourteen people 
answered that they generally sell three cars; 
nineteen generally sell four cars; twelve generally 
sell five cars; nine generally sell six cars; eleven 
generally sell seven cars. Calculate the following: 


sample mean = x = 


Mean = (14*34+19*44+12*5+9*6+11*7)/65 
= 4.75 


median = 


mode = 


Mode = 4 (occurs 19 times) 


The following data are the distances between 
20 retail stores and a large distribution center. 
The distances are in miles. 


29; 37; 38; 40; 58; 67; 68; 69; 76; 86; 87; 95; 
96; 96; 99; 106; 112; 127; 145; 150 

Use a computer to find the standard deviation 
and round to the nearest tenth. 


s = 34.5 


Bringing It Together 


Javier and Ercilia are supervisors at a shopping 
mall. Each was given the task of estimating the 
mean distance that shoppers live from the mall. 
They each randomly surveyed 100 shoppers. 

The samples yielded the following information. 


a eS T2312 
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1. How can you determine which survey was 
correct ? 
2. Explain what the difference in the results 


of the surveys implies about the data. 

. If the two histograms depict the 
distribution of values for each supervisor, 
which one depicts Ercilia's sample? How 
do you know? 


(a) (b) 


. It is difficult to determine which survey is 
correct. Both surveys include the same 
number of shoppers and the shoppers were 
randomly selected. We could look at how 
the random selection was done to see if 
one of the sampling techniques would 
result in a more representative sample. But 
if they used the same sampling technique, 
there is no way to know which sample is 
right. The only way would be to take 
another, larger sample and see which of 
the two supervisor's samples most closely 
matches that sample. But really we expect 
there to be sampling variability so it is not 
really an appropriate question to ask 
which is "correct". 

. Since the mean is the same for both 
samples, this suggests that it is fair to say 
that on average shoppers travel 6.0 km to 
the mall. But the standard deviations are 


different. This suggests that it is not yet 
clear how much variation there is from the 
6.0km. 

3. Ercilia's data has a larger standard 
deviation. Therefore, on average, the data 
needs to be more spread out from the 
mean than Javier's. This suggests (b) is the 
answer. 


Use the following information to answer the next three 
exercises: We are interested in the number of years 
students in a particular elementary statistics class 
have lived in California. The information in the 
following table is from the entire section. 


Number of Frequency Number of Frequency 


years years 
Total = 20 

7 1 99 1 
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What is the mode? 


1.19 

2. 19.5 

3. 14 and 20 
4, 22.65 


Mode = 19 (occurs 4 times) 


Is this a sample or the entire population? 


1. sample 
2. entire population 
3. neither 


A survey of enrollment at 35 community 
colleges across the United States yielded the 
following figures: 


6414; 1550; 2109; 9350; 21828; 4300; 5944; 
9722; 2825; 2044; 5481; 5200; 5853; 2750; 
10012; 6357; 27000; 9414; 7681; 3200; 17500; 
9200; 7380; 18314; 6557; 13713; 17768; 7493; 
2771; 2861; 1263; 7285; 28165; 5080; 11622 


1. Organize the data into a chart with six 
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intervals of equal width. Label the two 
columns "Enrollment" and "Frequency." 


. Construct a histogram of the data. 
. What is the shape of the data? What does 


the shape tell you about the enrollment at 
these community colleges? 


. What is the best measure of centre for this 


data and why? State the measure. 


. What is the best measure of variation for 


this data and why? State the measure. 


. If you were to build a new community 


college, what is the typical range for the 
enrollment? Why would this information 
be helpful? What caveats would you want 
to think about when you look at this 
typical range? 
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2. Histogram for enrollment at community 
colleges. 


Histogram 
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3. The shape is skewed to the right which 
means that there a few community colleges 
that have greater enrollment compared to 
most of the other colleges in the sample. 

4. Since the mean (8628.74) is being skewed 
(as it is larger than the median of 6,414), 
the median is the best measure of centre. 

5. Since we are only looking at one data set, 
the standard deviation is a good measure 
of variation. It is 6,943.88. 

6. The typical range is 6,414 +/- 6,943.88 = 
-529.88 to 13,357.88. As there can't be 
negative students enrolled, the typical 
range is 0 students to 13,357.88. Though 
there could be multiple caveats, one 
concern is the rather large variation in the 
data. This means that community colleges 
have very different enrollment rates. 
Perhaps looking at community colleges 
that are similar to the one I would like to 


open would be more beneficial as that 
population would be more representative 
of my community college. 


You work for a soda pop company that is 
producing a new label for their Asian market. 
Three different labels your company is 
considering are the same, except the colours are 
different. The colour choices are blue, green 
and orange. 


To determine which label consumers prefer, 
focus groups were done. One such focus group 
asked 15 participants to rate the cans from 1 to 
10. A score of 1 means they hated the label and 
10 means they loved the label. The results 
follow. 


ParticipantBlue Label Green Orange 
Label Label 
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Which label would you recommend as the new 
label for the Asian market? Support your 
decision using the data. 


Label 1 is excluded as most people don’t like it. 
The mean for label 2 and label 3 is the same. 
Label 2 could be considered the better label 
because more people love it than label 3, but 
more people hate it. Label 3 could be 
considered a better label because the variation 
is less - nobody hates it, but nobody loves it. 
(Note: Even though you are comparing two 
data sets, it is ok to look only at the standard 
deviation instead of the coefficient of variation 
in this situation. Why?). 


Choosing label 2 has greater risk (love/hate 
relationship). Choosing label 3 has less risk 
(most people like it). 


Three publicly traded telecommunications 
companies reported their monthly profit for the 
last year. The results are presented below. 


Ca A Camnandt D Ca 7 
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Standard $4,196 $9,360 $4,116 
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Range $15,050 $42,150 $16,400 


1. Donna is close to retirement and wants to 
invest in one of the three companies. She 
doesn’t want to see her investment drop 
significantly as she doesn’t want to see her 
retirement savings dwindle. Which 
company would you recommend she invest 
in and why? 

2. What information is missing from the list 
that you might want to have to help you 
answer the above question? 

3. What information below is not necessary 
for making this decision? 


Note that this question is about risk, i.e. 
variation. 


1. Any answer requires that you examine the 
amount of variation in the data set. The 
coefficient of variation is the best measure 
to use to compare the variation as the 
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Company Company Company 
Ay B g 
Coefficient38.39% ~—-72% +4-95% 
of 
variation 


2. The information provided is only for one 
year. It would be helpful to know about 
their changes over more than one year. 
Quartiles aren’t provided. They could help 
examine the variation as well. 

3. The median and the mode are not relevant 
as this is a question about variation. The 
mean is only required as it is needed to 
find the coefficient of variation. 


Glossary 


Frequency Table 
a data representation in which grouped data 
is displayed along with the corresponding 


Mean 


frequencies 


a number that measures the central tendency 
of the data; a common name for mean is 
'average.' The term 'mean' is a shortened form 
of 'arithmetic mean.' By definition, the mean 
for a sample (denoted by x) isx- = 

Sum of all values in the sample 

Number of values in the sample , and the 
mean for a population (denoted by yw) is u= 
Sum of all values in the population 

Number of values in the population . 


Median 


a number that separates ordered data into 
halves; half the values are the same number 
or smaller than the median and half the 
values are the same number or larger than the 
median. The median may or may not be part 
of the data. 


Midpoint 


Mode 


the mean of an interval in a frequency table 


the value that appears most frequently in a 
set of data 


Measures of Location and Box Plots -- MRU -- C 
Lemieux (2017) 


Introduction 


Measures of location help us to understand where 
data values are located relative to other data values. 
We've already seen a measure of location - the 
median. It tells us what data value is in the middle 
of the data set. The most common measure of 
position is a percentile . Percentiles divide ordered 
data into hundredths. To score in the 90th percentile 
of an exam does not mean, necessarily, that you 
received 90% on a test. It means that 90% of test 
scores are the same or less than your score and 10% 
of the test scores are the same or greater than your 
test score. The median is the 50th percentile 


A special type of percentile are called quartiles. 
Quartiles divide ordered data into quarters. The first 
quartile, Qi, is the same as the 25th percentile, and 
the third quartile, Q3, is the same as the 75th 
percentile. The median, M, is called both the second 
quartile and the 50th percentile. 


A visual representation of measures of location is 
called a box plot. 


In this section, we will learn how to find quartiles 


and use those quartiles to find the interquartile 
range and outliers. Then we will visually represent 
this information on a box plot. Unlike histograms 
and bar graphs, box plots require the use of 
numerical summaries. Thus the box plot is a 
representation that combines both visual and 
numerical summaries of the data. 


Measures of location 


As described in the introduction, a common measure 
of location are percentiles. Percentiles are useful for 
comparing values. For this reason, universities and 
colleges use percentiles extensively. One instance in 
which colleges and universities use percentiles is 
when SAT results are used to determine a minimum 
testing score that will be used as an acceptance 
factor. For example, suppose Duke accepts SAT 
scores at or above the 75th percentile. That 
translates into a score of at least 1220. 


Percentiles are mostly used with very large 
populations. Therefore, if you were to say that 90% 
of the test scores are less (and not the same or less) 
than your score, it would be acceptable because 
removing one particular data value is not 
significant. 


The median is a number that measures the "center" 
of the data. You can think of the median as the 


"middle value," but it does not actually have to be 
one of the observed values. It is a number that 
separates ordered data into halves. Half the values 
are the same number or smaller than the median, 
and half the values are the same number or larger. 
For example, consider the following data. 

de 156; 7.2:-4::8: 9: 10: 6:8.'8.3: 2. 2: 10:1 
Ordered from smallest to largest: 

1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5 


Since there are 14 observations, the median is 
between the seventh value, 6.8, and the eighth 
value, 7.2. To find the median, add the two values 
together and divide by two. 

6.8+7.22=7 


The median is seven. Half of the values are smaller 
than seven and half of the values are larger than 
seven. 


Quartiles are numbers that separate the data into 
quarters. Quartiles may or may not be part of the 
data. To find the quartiles, first find the median or 
second quartile. The first quartile, Qi, is the middle 
value of the lower half of the data, and the third 
quartile, Q3, is the middle value, or median, of the 
upper half of the data. To get the idea, consider the 
same data set: 

1; 1; 2: 2; 4; 6;:6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5 


Quartiles are numbers that separate the data into 


quarters. Quartiles may or may not be part of the 
data. To find the quartiles, first find the median or 
second quartile. The first quartile, Qi, is the middle 
value of the lower half of the data, and the third 
quartile, Q3, is the middle value, or median, of the 
upper half of the data. To get the idea, consider the 
same data set: 

Is 1; 2:2; 4: 6; 6:8: 7.2; 8; 8:3:°9: 10; 10: 11:5 


The median or second quartile is seven. The lower 
half of the data are 1, 1, 2, 2, 4, 6, 6.8. The middle 
value of the lower half is two. 

1; 1; 2; 2; 4; 6; 6.8 


The number two, which is part of the data, is the 
first quartile. One-fourth of the entire sets of values 
are the same as or less than two and three-fourths of 
the values are more than two. 


The upper half of the data is 7.2, 8, 8.3, 9, 10, 10, 
11.5. The middle value of the upper half is nine. 


The third quartile, Q3, is nine. Three-fourths (75%) 
of the ordered data set are less than nine. One- 
fourth (25%) of the ordered data set are greater 
than nine. The third quartile is part of the data set 
in this example. 


Possible Quartile Positions 


servations less than Q1 M& = observations less than Q» but greater than Q) 


servations less than Qs but greater than Q» observations greater than Qs 


As mentioned in the previous section, the 
interquartile range is a measure of variation. It is a 
number that indicates the spread of the middle half 
or the middle 50% of the data. It is the difference 
between the third quartile (Q3) and the first quartile 
(Q1). 


IQR = Q3- Qi 


The JQR can help to determine potential outliers. A 
value is suspected to be a potential outlier if it is 
less than (1.5)(/QR) below the first quartile or 
more than (1.5)((QR) above the third quartile. 
Potential outliers always require further 
investigation. 


NOTE 
potential outlier is a data point that is 
significantly different from the other data points. 


These special data points may be errors or some 
kind of abnormality or they may be a key to 
understanding the data. 


For the following 13 real estate prices, 
calculate the IQR and determine if any prices 
are potential outliers. Prices are in dollars. 


389,950; 230,500; 158,000; 479,000; 639,000; 
114,950; 5,500,000; 387,000; 659,000; 
529,000; 575,000; 488,800; 1,095,000 


Order the data from smallest to largest. 
114,950; 158,000; 230,500; 387,000; 389,950; 
479,000; 488,800; 529,000; 575,000; 639,000; 
659,000; 1,095,000; 5,500,000 


M = 488,800 


Qi = 230,500 + 387,000 2 = 308,750 


Q3 = 639,000 + 659,000 2 = 649,000 
IQR = 649,000 — 308,750 = 340,250 
(1.5)UQR) = (1.5)(340,250) = 510,375 


1.5(1QR) less than the first quartile: Qi — (1.5) 
(IQR) = 308,750 — 510,375 = -201,625 


1.5(IQR) more than the first quartile:Q3 + 
(1.5)UQR) = 649,000 + 510,375 = 
1,159;375 


No house price is less than —201,625. 
However, 5,500,000 is more than 1,159,375. 
Therefore, 5,500,000 is a potential outlier. 


For the two data sets in the test scores 
example, find the following: 


1. The interquartile range. Compare the two 
interquartile ranges. 
2. Any outliers in either set. 


The five number summary for the day and 
night classes is 
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Night: 25.5 78 81 89 98 


1. The IQR for the day group is Q3 — Qi = 
82.5 -56 = 26.5 


The IQR for the night group is Q3 - Qi = 


89-78 = 11 


The interquartile range (the spread or 
variability) for the day class is larger than 
the night class IQR. This suggests more 
variation will be found in the day class’s 
class test scores. 

. Day class outliers are found using the IQR 
times 1.5 rule. So, 


* Qi - IQR(1.5) = 56 - 26.5(1.5) = 
16.25 

= Q374 JORC¢5) = 32.5 2 -26-5(1..9) 
= 122.25 


Since the minimum and maximum values 
for the day class are greater than 16.25 
and less than 122.25, there are no 
outliers. 


Night class outliers are calculated as: 


* Qi - IQR (1.5) = 78-11(1.5) = 
iol ee) 

¢ Q3 + IOR(1.5) = 89 + 11(1.5) = 
105.5 


For this class, any test score less than 61.5 
is an outlier. Therefore, the scores of 45 
and 25.5 are outliers. Since no test score 
is greater than 105.5, there is no upper 
end outlier. 


Po 


Interpreting Percentiles, Quartiles, and Median 


A percentile indicates the relative standing of a data 
value when data are sorted into numerical order 
from smallest to largest. Percentages of data values 
are less than or equal to the pth percentile. For 
example, 15% of data values are less than or equal 
to the 15th percentile. 


* Low percentiles always correspond to lower 
data values. 

* High percentiles always correspond to higher 
data values. 


A percentile may or may not correspond to a value 
judgment about whether it is "good" or "bad." The 
interpretation of whether a certain percentile is 
"good" or "bad" depends on the context of the 
situation to which the data applies. In some 
situations, a low percentile would be considered 
"good;" in other contexts a high percentile might be 
considered "good". In many situations, there is no 
value judgment that applies. 


Understanding how to interpret percentiles properly 
is important not only when describing data, but also 
when calculating probabilities in later chapters of 
this text. 


Guideline 

When writing the interpretation of a percentile in 
the context of the given data, the sentence should 
contain the following information. 


information about the context of the situation 
being considered 

the data value (value of the variable) that 
represents the percentile 

the percent of individuals or items with data 
values below the percentile 

the percent of individuals or items with data 
values above the percentile. 


On a timed math test, the first quartile for time 
it took to finish the exam was 35 minutes. 
Interpret the first quartile in the context of this 
situation. 


* Twenty-five percent of students finished 
the exam in 35 minutes or less. 

* Seventy-five percent of students finished 
the exam in 35 minutes or more. 

- A low percentile could be considered 
good, as finishing more quickly on a 
timed exam is desirable. (If you take too 
long, you might not be able to finish.) 


On a 20 question math test, the 70th percentile 
for number of correct answers was 16. 
Interpret the 70th percentile in the context of 
this situation. 


Seventy percent of students answered 16 
or fewer questions correctly. 

Thirty percent of students answered 16 or 
more questions correctly. 

A higher percentile could be considered 
good, as answering more questions 
correctly is desirable. 


Try It 


On a 60 point written assignment, the 80th 
percentile for the number of points earned was 
49. Interpret the 80th percentile in the context 
of this situation. 


Eighty percent of students earned 49 points or 


fewer. Twenty percent of students earned 49 
or more points. A higher percentile is good 
because getting more points on an assignment 
is desirable. 


At a community college, it was found that the 
30th percentile of credit units that students are 
enrolled for is seven units. Interpret the 30th 


percentile in the context of this situation. 


¢ Thirty percent of students are enrolled in 
seven or fewer credit units. 
Seventy percent of students are enrolled 
in seven or more credit units. 
In this example, there is no "good" or 
"bad" value judgment associated with a 
higher or lower percentile. Students 
attend community college for varied 
reasons and needs, and their course load 
varies according to their needs. 


Outliers 


Above the idea of potential outliers were discussed. 
This section will look more in depth at how to find 
outliers and how to categorize them. 


Quartiles can also be used to determine if there are 
any outliers in a data set. To determine if there are 
outliers, we need to first calculate the inner and 
outer fences. The fences define the boundary 
between a “normal” data value and an “abnormal” 
data value (or outlier). Any data values that fall 
between the inner fences are normal data values. 
Any data values that fall outside the inner fences 
are considered outliers. 


The fences are calculated as follows: 


The inner fences are Q1 - IQR(1.5) and Q3 + 
IQR(1.5). 


The outer fences are Q1 - IQR(3) and Q3 + IQR(3). 


A mild outlier is any data value between the inner 
and outer fences. 


An extreme outlier is any data value to the extreme 
of the outer fence. 


Finding outliers 
Sharpe Middle School is applying for a grant that 
will be used to add fitness equipment to the gym. 


The principal surveyed 15 anonymous students to 
determine how many minutes a day the students 
spend exercising. The results from the 15 
anonymous students are shown. 
O minutes; 40 minutes; 60 minutes; 30 minutes; 60 
minutes 10 minutes; 45 minutes; 30 minutes; 300 
minutes; 90 minutes; 30 minutes; 120 minutes; 60 
minutes; O minutes; 20 minutes 
The five-number summary is determined to be: Min 
= 0; Q1 = 20; Med = 40; Q3 = 60; Max = 300. 
re there any students who are exercising 
significantly more or less than the other students? 
To answer this question, we have to determine if 
there are any outliers. 
To do this, determine the inner fences. 
The IQR is 60-20 = 40. 
The lower inner fence is Qi - IQR(1.5) = 20- 
40(1.5) = -40$ and the upper inner fence is Q3 + 
QR(1.5) = 60 + 40(1.5) = 120$. Thus, any 
student who exercises between -40 minutes and 
120 minutes is exercising a “normal” amount of 
time (relative to the rest of the students). Since 
Someone can’t exercise -40 minutes, this is really 0 
minutes to 120 minutes. Therefore, 300 minutes 
appears to be an outlier. But is it a mild outlier or 
an extreme outlier? 
To determine if it is mild or extreme, we need to 
calculate the outer fence. We only need the upper 
outer fence as there are no low outliers (no one 
exercised for less than -40 minutes). The upper 
outer fence is Q + IQR(3) = 60 + 40(3) = 180$. 


If the potential outlier is between 120 and 180 
minutes, then it is a mild outlier (as it is between 
the upper inner and outer fences). If it is more than 
180 minutes, then it is an extreme outlier. In this 
case, 300 minutes is an extreme outlier. This means 
that this student is exercising way more than the 
rest of their classmates! 


Box Plots 


Box plots (also called box-and-whisker plots or 
box-whisker plots) give a good graphical image of 
the concentration of the data. They also show how 
far the extreme values are from most of the data. 


To construct a box plot, use a horizontal or vertical 
number line and a rectangular box. The smallest and 
largest data values label the endpoints of the axis. 
The first quartile marks one end of the box and the 
third quartile marks the other end of the box. 
Approximately the middle 50 percent of the data 
fall inside the box. The "whiskers" extend from the 
ends of the box to the smallest and largest data 
values. The median or second quartile can be 
between the first and third quartiles, or it can be 
one, or the other, or both. The box plot gives a 
good, quick picture of the data. 


A box plot is constructed from the five-number 
summary (the minimum value, the first quartile, the 
median, the third quartile, and the maximum value) 
and, if there are outliers, the fences. We use these 
values to compare how close other data values are 
to them. 

Example of a box plot 

This is an example of a box plot. The box is in the 
middle and represents 50% of the data. The circles 
on the right represent outliers and the dashed lines 
the fences. The outliers at approximately 22000 and 
27000 are mild outliers, while the outlier at 
approximately 28500 is an extreme outlier. 


BoxPlot 


1 ° o10 
1 1 
1 1 


0 5000 10000 15000 20000 25000 30000 
Data 


To construct a box plot, use a horizontal or vertical 
number line and a rectangular box. The smallest and 
largest data values label the endpoints of the axis. 
The first quartile marks one end of the box and the 
third quartile marks the other end of the box. The 
median is represented by a line inside the box. The 
middle 50 percent of the data fall inside the box and 
the length of the box is the interquartile range. 


The "whiskers" extend from the ends of the box to 
the first data values inside the fences. If there are no 
outliers, this would be minimum and maximum 


values. The outliers are represented by asterisks or 
dots and fall either between the inner and outer 
fences (mild outlier) or outside the outer fences 
(extreme outlier). 


Consider, again, this dataset. 
112246687.288.39101011.5 


From the work done above, we know the five 
number summary is 1, 2, 7, 9, 11.5. The IQR is 9-2 
= 7. IQR(1.5) is 7*1.5 = 10.5. The lower inner 
fence is Q1-IQR(1.5) = 2-10.5=-8.5 and the upper 
inner fence is Q3+IQR(1.5)=9+10.5 = 19.5. Since 
no data values are smaller than -8.5 or larger than 
19.5, there are no outliers in the data set. 


The two whiskers extend from the first quartile to 
the smallest value and from the third quartile to the 
largest value. The median is shown with a dashed 
line. 


NOTE 
It is important to start a box plot with a scaled 


number line. Otherwise the box plot may not be 


useful. 


The following data are the heights of 40 students 
(in inches) in a statistics class. 

59; 60; 61; 62; 62; 63; 63; 64; 64; 64; 65; 65; 65; 
65; 65; 65; 65; 65; 65; 66; 66; 67; 67; 68; 68; 69; 
70270: JO O27 OFT IN FI TS a ALTO: 
Bev 

Take this data and input it into Excel. Use the "Text 
to Columns" function in the "Data" menu to 
separate the data into separate columns. Then copy 
the data, but when you paste it, use paste special to 
"Transpose" the data so it is all in one column. 

INow use whatever software you are using to find 
the five-number summary. 


¢ Minimum value = 59 

* Q1: First quartile = 64.75 

* Q2: Second quartile or median= 66 
¢ Q3: Third quartile = 70 

* Maximum value = 77 


re there outliers? The IQR is 70-64.75 = 5.25. 
IQR(1.5) = 7.875 (don't round until the end) 
The lower inner fence is Q1 - IQR(1.5) = 
64.75-7.875 = 56.875. Since the minimum value is 
59, there are no lower outliers. 
The upper inner fence is Q3 + IQR(1.5) = 
70+ 7.875 = 77.875. Since the maximum value is 


77, there are no upper outliers. 

You can also use your computer program to create 
a box plot for the data. 

Box plot of height of 40 students 


BoxPlot of heights of students in statistics class 
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he titles and labels for a box plot follow the same 
rules as they do for a histogram or a bar graph. 


What does the box plot tell us? 


* Each quarter has approximately 25% of the 
data. 

* The spreads of the four quarters are 64.75 — 
59 = 5.75 (first quarter), 66 - 64.75 = 1.25 
(second quarter), 70 — 66= 4 (third quarter), 
and 77 — 70 = 7 (fourth quarter). So, the 
second quarter has the smallest spread and the 
fourth quarter has the largest spread. 

* Range = maximum value — the minimum 
value = 77 —59 = 18, which means that 
from the shortest to the tallest student there is 


a difference of 18 inches. 

Interquartile Range: IQR = third quartile - 
first quartile = 70 - 64.75 = 5.25, which 
means that the middle 50% (middle half) of 
the data has a range of 5.25 inches. This also 
means the length of the box is 5.25. 


Try It 


The following data are the number of pages in 
40 books on a shelf. Construct a box plot using 
computer software, and state the interquartile 

range. 


136 140 178 190 205 215 217 218 232 234 
240) 255 270275 290 301/303 315 317 318 
326 333 343 349 360 369 377 388 391 392 
398 400 402 405 408 422 429 450 475 512 
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SSS SS SS SS SSS 
120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 520 540 


IQR = 158 


For some sets of data, some of the largest value, 
smallest value, first quartile, median, and third 


quartile may be the same. For instance, you might 
have a data set in which the median and the third 
quartile are the same. In this case, the diagram 
would not have a dotted line inside the box 
displaying the median. The right side of the box 
would display both the third quartile and the 
median. For example, if the smallest value and the 
first quartile were both one, the median and the 
third quartile were both five, and the largest value 
was seven, the box plot would look like: 


1 2 3 4 5 6 f 


In this case, at least 25% of the values are equal to 
one. Twenty-five percent of the values are between 
one and five, inclusive. At least 25% of the values 
are equal to five. The top 25% of the values fall 
between five and seven, inclusive. 


Test scores for a college statistics class held during 
the day are: 
99 56 78 55.5 32 90 80 81 56 59 45 77 84.5 84 70 


72 68 32 79 90 

Test scores for a college statistics class held during 
the evening are: 

98 78 68 83 81 89 88 76 65 45 98 90 80 84.5 85 
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1. Find the smallest and largest values, the 
median, and the first and third quartile 
for the day class. 

2. Find the smallest and largest values, the 
median, and the first and third quartile 
for the night class. 

3. For each data set, what percentage of the 
data is between the smallest value and the 
first quartile? the first quartile and the 
median? the median and the third 
quartile? the third quartile and the largest 
value? What percentage of the data is 
between the first quartile and the largest 
value? 

4. Create a box plot for each set of data. Use 
one number line for both box plots. 

5. Which box plot has the widest spread for 
the middle 50% of the data (the data 
between the first and third quartiles)? 
What does this mean for that set of data 
in comparison to the other set of data? 


1. ¢ Min = 32 


$71. — 0 
© M = 74.5 
Os 102-5 


* Max = 99 
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13. Day class: There are six data values 
ranging from 32 to 56: 30%. There are six 
data values ranging from 56 to 74.5: 30%. 
There are five data values ranging from 
74.5 to 82.5: 25%. There are five data 
values ranging from 82.5 to 99: 25%. 
There are 16 data values between the first 
quartile, 56, and the largest value, 99: 
75%. Night class: 


14. 
ee 
a 


20 30 40 50 60 70 80 90 100 
15. The first data set has the wider spread for 
the middle 50% of the data. The JQR for 
the first data set is greater than the IQR 
for the second set. This means that there 
is more variability in the middle 50% of 
the first data set. 


Try It 


The following data set shows the heights in 
inches for the boys in a class of 40 students. 


66; 66; 67; 67; 68; 68; 68; 68; 68; 69; 69; 69; 
PO: 7 Ne 727 2 227 a7 oe TA 

The following data set shows the heights in 
inches for the girls in a class of 40 students. 
©1: 61: 62: 62: 63; 63; 63; 65: 65: 65; 66; 66; 
66; 67; 68; 68; 68; 69; 69; 69 

Construct a box plot using computer software 
for each data set, and state which box plot has 
the wider spread for the middle 50% of the 
data. 


Heights of boys 


— iz 


Heights of girls 


60| G1 62 63 64 65S 66 67 68 69 70 71 72 73 74 75 76 


IQR for the boys = 4 
IQR for the girls = 5 


The box plot for the heights of the girls has the 
wider spread for the middle 50% of the data. 
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Chapter Review 


The values that divide a rank-ordered set of data 
into 100 equal parts are called percentiles. 


Percentiles are used to compare and interpret data. 
For example, an observation at the 50th percentile 
would be greater than 50 percent of the other 
obeservations in the set. Quartiles divide data into 
quarters. The first quartile (Qi) is the 25th 
percentile,the second quartile (Q2 or median) is 50th 
percentile, and the third quartile (Q3) is the the 75th 
percentile. The interquartile range, or IQR, is the 
range of the middle 50 percent of the data values. 
The JQR is found by subtracting Qi from Q3, and can 
help determine outliers by using the following two 
expressions. 


* Q3 + IQR(1.5) 
* Qi —JQR(1.5) 


Box plots are a type of graph that can help visually 
organize data. To graph a box plot the following 
data points must be calculated: the minimum value, 
the first quartile, the median, the third quartile, and 
the maximum value. Once the box plot is graphed, 
you can display and compare distributions of data. 


On an exam, would it be more desirable to earn 
a grade with a high or low percentile? Explain. 


It is better to earn a grade in a high percentile 
as that means that you have done better on the 
exam relative to your classmates. 


Mina is waiting in line at the Department of 
Motor Vehicles (DMV). Her wait time of 32 
minutes is the 85th percentile of wait times. Is 
that good or bad? Write a sentence interpreting 
the 85th percentile in the context of this 
situation. 


When waiting in line at the DMV, the 85th 
percentile would be a long wait time compared 
to the other people waiting. 85% of people had 
shorter wait times than Mina. In this context, 
Mina would prefer a wait time corresponding to 
a lower percentile. 85% of people at the DMV 
waited 32 minutes or less. 15% of people at the 
DMV waited 32 minutes or longer. 


In a study collecting data about the repair costs 
of damage to automobiles in a certain type of 
crash tests, a certain model of car had $1,700 in 
damage and was in the 90th percentile. Should 
the manufacturer and the consumer be pleased 
or upset by this result? Explain and write a 
sentence that interprets the 90th percentile in 
the context of this problem. 


The manufacturer and the consumer would be 
upset. This is a large repair cost for the 
damages, compared to the other cars in the 
sample. INTERPRETATION: 90% of the crash 


tested cars had damage repair costs of $1700 or 
less; only 10% had damage repair costs of 
$1700 or more. 


Suppose that you are buying a house. You and 
your realtor have determined that the most 
expensive house you can afford is the 34th 
percentile. The 34th percentile of housing prices 
is $240,000 in the town you want to move to. 
In this town, can you afford 34% of the houses 
or 66% of the houses? 


You can afford 34% of houses. 66% of the 
houses are too expensive for your budget. 
INTERPRETATION: 34% of houses cost 
$240,000 or less. 66% of houses cost $240,000 
or more. 


Sixty-five randomly selected car salespersons 
were asked the number of cars they generally 
sell in one week. Fourteen people answered that 
they generally sell three cars; nineteen 
generally sell four cars; twelve generally sell 
five cars; nine generally sell six cars; eleven 
generally sell seven cars. Construct a box plot 
for this data. 


BoxPlot of number of cars sold per salesperson 


2 3 4 5 6 7 8 
number of cars sold per salesperson 


Looking at your box plot in the exercise above, 
does it appear that the data are concentrated 
together, spread out evenly, or concentrated in 
some areas, but not in others? How can you 
tell? 


More than 25% of salespersons sell four cars in 
a typical week. You can see this concentration 
in the box plot because the first quartile is 
equal to the median. The top 25% and the 
bottom 25% are spread out evenly; the whiskers 
have the same length. 


In a survey of 20-year-olds in China, Germany, 
and the United States, people were asked the 
number of foreign countries they had visited in 
their lifetime. The following box plots display 
the results. 


China 


Germany 


United States 


1. In complete sentences, describe what the 
shape of each box plot implies about the 
distribution of the data collected. 

2. Have more Americans or more Germans 
surveyed been to over eight foreign 
countries? 

3. Compare the three box plots. What do they 
imply about the foreign travel of 20-year- 
old residents of the three countries when 
compared to each other? 


1. The shape of China suggests that either 
every person they surveyed except one 
either visited 0 foreign countries or 5 
foreign countries. For example, if 30 
people were interviewed in China, 29 of 
them have visited no foreign country and 
one of them has visited 5 foreign countries 
OR 29 of them have visited 5 foreign 
countries and one of them has visited no 
foreign countries. It is unclear which way 
it is going in the box plot. In Germany, 
50% of those surveyed have visited 8 or 
less countries. Based on the position of the 


median, this suggests that there are many 
people in the survey who have visited 
eight countries. This suggests the 
distribution will have a peak at 8 and will 
be non-symmetric. In the USA, 50% of 
those surveyed have visited 2 or less 
countries. As there are no whiskers, this 
suggests that 25% of the Americans 
surveyed have visited no foreign countries 
which suggest a skew to the right for the 
distribution. 

2. 25% of Germans surveyed have been to 
more than 8 foreign countries. It is unclear 
what the percentage is for Americans but it 
is less than 25%. Therefore, Germany. 

3. Germans in the survey have visited far 
more countries that Americans and the 
Chinese in the survey. China has the least 
foreign travel. 


Given the following box plot, answer the 
questions. 


0 20 100 150 


1. Think of an example (in words) where the 
data might fit into the above box plot. In 
2-5 sentences, write down the example. 


2. What does it mean to have the first and 
second quartiles so close together, while 
the second to third quartiles are far apart? 


1. Answers will vary. Possible answer: State 
University conducted a survey to see how 
involved its students are in community 
service. The box plot shows the number of 
community service hours logged by 
participants over the past year. 

2. Because the first and second quartiles are 
close, the data in this quarter is very 
similar. There is not much variation in the 
values. The data in the third quarter is 
much more variable, or spread out. This is 
clear because the second quartile is so far 
away from the third quartile. 


A survey was conducted of 130 purchasers of 
new BMW 3 series cars, 130 purchasers of new 
BMW 5 series cars, and 130 purchasers of new 
BMW 7 series cars. In it, people were asked the 
age they were when they purchased their car. 
The following box plots display the results. 


BMW 3 series 
BMW 5 series 


BMW 7 series 


. In complete sentences, describe what the 
shape of each box plot implies about the 
distribution of the data collected for that 
car series. 

. Which group is most likely to have an 
outlier? Explain how you determined that. 
. Compare the three box plots. What do they 
imply about the age of purchasing a BMW 
from the series when compared to each 
other? 

. Look at the BMW 5 series. Which quarter 
has the smallest spread of data? What is 
the spread? 

. Look at the BMW 5 series. Which quarter 
has the largest spread of data? What is the 
spread? 

. Look at the BMW 5 series. Estimate the 
interquartile range (IQR). 

. Look at the BMW 5 series. Are there more 
data in the interval 31 to 38 or in the 
interval 45 to 55? How do you know this? 
. Look at the BMW 5 series. Which interval 
has the fewest data in it? How do you 
know this? 


. Each box plot is spread out more in the 
greater values. Each plot is skewed to the 
right, so the ages of the top 50% of buyers 
are more variable than the ages of the 
lower 50%. 

. The BMW 3 series is most likely to have an 
outlier. It has the longest whisker. 

. Comparing the median ages, younger 
people tend to buy the BMW 3 series, 
while older people tend to buy the BMW 7 
series. However, this is not a rule, because 
there is so much variability in each data 
set. 

. The second quarter has the smallest 
spread. There seems to be only a three- 
year difference between the first quartile 
and the median. 

. The third quarter has the largest spread. 
There seems to be approximately a 14-year 
difference between the median and the 
third quartile. 

. IQR ~ 17 years 

. There is not enough information to tell. 
Each interval lies within a quarter, so we 
cannot tell exactly where the data in that 
quarter is concentrated. 


8. The interval from 31 to 35 years has the 
fewest data values. Twenty-five percent of 
the values fall in the interval 38 to 41, and 
25% fall between 41 and 64. Since 25% of 
values fall between 31 and 38, we know 
that fewer than 25% fall between 31 and 
35. 


The following data represents the number of 
passengers per flight on the AirBus from 
Calgary to Edmonton for 24 flights. 


8, 19, 22, 23, 29, 30, 34, 35, 37, 39, 41, 44, 44, 
46, 46, 47, 48, 49, 50, 52, 54, 55, 61, 65 


1. Generate the boxplot for this data. 

2. Identify the outliers in the data. Are they 
low or high outliers? Are the extreme or 
mild outliers? 

3. Interpret the outliers in the context of the 
question. 

4. What is the IQR? Interpret it in the context 
of the question. 

5. Which quarter of the data is the most 
concentrated? The least concentrated? 

6. What is the five-number summary 
(minimum, first quartile, median, third 
quartile, maximum)? 


BoxPlot of number of passengers on AirBus 


0) 10 20 30 40 50 60 70 
number of passengers 


. There is one mild low outlier of 8 
passengers on a flight. 

. a) The outlier means that on this flight 
there were significantly fewer passengers 
(only 8) than there are on other similar 
flights. 

. The IQR is 16.25 (from 33 to 49.25). This 
means that 50% of the time, the number of 
passengers is between 33 and 49.25 on the 
Airbus. This gives us a sense of the amount 
of variation in the number of passengers. 

. The distance between the median and the 
third quartile (from 44 to 49.25) is the 
least (5.25). This means that these 25% of 
data values are closely packed together. 
While the distance between the outlier and 
the first quartile is the largest (25 
passengers). This means that these 25% of 
the data values are spread out from each 
other. 

. a) The five-number summary is: Minimum 
= 8; First quartile = 33; Median = 44; 
Third quartile = 49.25; Maximum = 65. 


Bringing It Together 


Santa Clara County, CA, has approximately 
27,873 Japanese-Americans. Their ages are as 
follows: 


A ~~ M..7~-44 N72. ~ ant AL M71 fae 
LASBEeO wivuup GFULLCLIL VL NMULILIDIUTLILY 
nA 17 199A 

vias bUev 

190 OA on 

LU ait Uevs 

9E._ OA 99 Qa 

eeu aceu 

QL AA 1ENn 

ve | Lucu 

AL EA 191 

Om Ome | hUet 

FE 4A 11.a 

vu vi biev 

65+ 10.3 


1. Construct a histogram of the Japanese- 
American community in Santa Clara 
County, CA. The bars will not be the same 
width for this example. Why not? What 
impact does this have on the reliability of 
the graph? 

2. What percentage of the community is 
under age 35? 

3. Which box plot most resembles the 
information above? 


. This is technically not a histogram as the 
bars aren't touching, but without the 
original data this is the best that I could 
come up with unless I drew it by hand! 


"Histogram" of ages of Japenese-Americans in 
Santa Clara County 
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0-17 18-24 25-34 35-44 45-54 55-64 


. 49.7% of the community is under the age 
of 35. 

. Based on the information in the table, 
graph (a) most closely represents the data. 


Glossary 


Box plot 
a graph that gives a quick picture of the 
middle 50% of the data 


First Quartile 
the value that is the median of the of the 
lower half of the ordered data set 


Frequency Polygon 
looks like a line graph but uses intervals to 
display ranges of large amounts of data 


Interval 
also called a class interval; an interval 
represents a range of data and is used when 
displaying large data sets 


Paired Data Set 
two data sets that have a one to one 
relationship so that: 


* both data sets are the same size, and 

* each data point in one data set is 
matched with exactly one point from the 
other set. 


Skewed 
used to describe data that is not symmetrical; 
when the right side of a graph looks “chopped 
off” compared the left side, we say it is 


“skewed to the left.” When the left side of the 
graph looks “chopped off” compared to the 
right side, we say the data is “skewed to the 
right.” Alternatively: when the lower values 
of the data are more spread out, we say the 
data are skewed to the left. When the greater 
values are more spread out, the data are 
skewed to the right. 


Practice Questions (Descriptive Statistics) - C 
Lemieux 
Practice Questions 


If a question has a set of data, please see the course 
site for the Excel file. 


The downturn in the oil and gas industry has 
ripple effects through multiple industries. 
Consumers are making hard choices when it 
comes to purchases. “Frivolous” expenses, like 
eating out, are some areas where consumers are 
choosing to reduce their spending. 


Duke’s is a family friendly eatery that also has a 
small bar attached. The restaurant can fit a 
maximum of 130 patrons at once and the bar 
can fit a maximum of 50 patrons. Based on 
anecdotal evidence, management at Duke’s 
believes that, on average, the number of 
customers coming to the restaurant has 
decreased since the downturn, but that the 
number of customers in the bar has stayed 
about the same. They think that customers 
don’t want to spend big money on a meal, but 


they still want to get together with friends for a 
pint to watch the game. They have also noticed 
that, though there are less customers in the 
restaurant, those that do come in are, on 
average, spending more per table than prior to 
the downturn. They think that when customers 
eat out, they are celebrating something so they 
spend more freely, which has resulted in a 
higher daily revenue. Further, the customers in 
the bar are mostly buying alcoholic beverages 
and are not getting food (just a pint, no 
nachos). That is, there may be the same number 
customers in the bar but, on average, they 
aren’t spending as much. 


The managers at Duke’s want to determine if 
there is any truth to the anecdotal evidence. To 
do this, they want you to do a statistical 
analysis of the data they’ve collected. Based on 
data from prior to the economic downturn, on 
average, there were 207.6 patrons at the 
restaurant per day and 98.3 patrons at the bar 
each day. The average daily revenue for the 
restaurant was $5,024.25, while for the bar was 
$1,796.4. 


The managers collected data by putting all of 
the nights they were open starting in January 
2015 into a spreadsheet until September 2017. 
Then from the top of the list, they choose every 
21st date to get their random sample of 48 


dates. Then for each date they chose, they 
found the number of patrons in the bar and in 
the restaurant for that date, and daily revenue 
in the bar and in the restaurant for that date. 
This is the data that is in the Excel sheet on BB. 


Te de 


2. 


What sampling technique did the 
managers use? 

There are at least two things that the 
managers did really wrong when they 
used their sampling technique. State 
two things wrong with their 
technique. Explain why they are 
wrong 


2. Investigate the managers’ belief about the 
number of patrons in the bar. To do this 
investigation, answer the following 
questions. 


1. 


There is one variable being studied 
that are relevant to this question. 
State the variable and which column 
in the spreadsheet have the data. 
Categorize the variable. 


. Use Excel to create a box plot for the 


variable. Insert the box plot below. 
Make sure to properly title and label 
the box plot. 


3. Are there any outliers? If so, what are 


they? 


4. If there are outliers, interpret the 
outliers in the context of the question. 
If there are not outliers, interpret 
what that means in the context of the 
question. 

5. Look at the box plot for the bar. Does 
it support or refute the managers’ 
belief about the number of patrons? 
Support your answer. Your answer 
needs to refer to appropriate measures 
(note the plural —- you have to refer to 
more than one measure). 


3. Investigate the managers’ belief about 
daily revenue in the restaurant. To do this 
investigation, answer the following 
questions. 


1. There is one variable being studied 
that are relevant to this question. 
State the variable and which column 
in the spreadsheet have the data. 
Categorize the variable. 

2. Use Excel to create all of the 
appropriate numerical summaries for 
the variable. Insert the results below. 

3. What is the best measure of centre? 
Explain your reasoning. Make sure 
you state the value of the measure 
you have chosen. 

4. What is the best measure of variation? 


Explain your reasoning. Make sure 
you state the value of the measure 
you have chosen. 

5. Look at your answers in c) and d). 
Does they support or refute the 
managers’ belief about the daily 
revenue in the restaurant? Support 
your answer. Your answer needs to 
refer to appropriate measures (note 
the plural — you have to refer to more 
than one measure). 


1. 1. Systematic “random” sampling 

2. 1. The managers did not choose a 
random starting point. Since they 
have not done this, they have missed 
the random part of the sampling 
technique. That is, their data was not 
collected randomly. 2. The managers 
choose k to be 21, which is exactly 
three weeks. This means that their 
sample will only collect data from the 
same day. Thus their data is not 
representative of the population. 


2. Investigate the managers’ belief about the 
number of patrons in the bar. To do this 
investigation, answer the following 
questions. 


1. Variable: Number of patrons in the 
bar per day. In column D. Type of 
variable: Quantitative (discrete) 

2. Boxplot for Exercise 1 


ber of patrons in bar per day 


3. There are three outliers (55, 57 x 2). 
They are all low, mild outliers . 

4. The outliers represent dates where 
there were very few patrons in the bar 
compared to the rest of the sample. 
This means business was slow on 
these dates. 

5. LThe managers’ claim is that the 
number of patrons in the bar has 
stayed about the same since the 
downturn, on average. Prior to the 
downturn, the number of patrons in 
the bar on average was 98.3. Based on 
the box plot, we can see that the 
median number of patrons after the 
downturn is 88.50 and the 
interquartile range is 13.25. This 
means that 50% of the data values are 
between 81.75 patrons per day and 95 
patrons per day. Thus, 75% of the 
data values are below 95 patrons per 
day. Further there are three low 
outliers. Based on this, it appears that 
the number of patrons per day in the 
bar has actually decreased. Thus, the 


data refutes the managers’ belief 
about the number of patrons. 


3. Investigate the managers’ belief about 
daily revenue in the restaurant. To do this 
investigation, answer the following 
questions. 


1. Variable: Daily revenue in the 
restaurant per day. In column B. Type 
of variable: Quantitative (continuous 
— but could also be considered 
discrete as you count revenue) 

2. Descriptive statistics for Exercise 1 


Daily 
revenue 


count 48 
mean 5,171.19 
sample variance 210,213.52 
sample standard 

deviation 458.49 
minimum 4516 
maximum 5961 
range 1445 
skewness 0.21 
kurtosis -1.32 
coefficient of variation 

(CV) 8.87% 
1st quartile 4,783.00 
median 5,077.50 
3rd quartile 5,590.00 
interquartile range 807.00 
mode 4,963.00 


low extremes 0 
low outliers 0 
0 


high outliers 
high extremes 0 


3. The best measure of centre for this 
data is the mean as there are no 
outliers and the mean (5171.19) and 
the median (5077.50) are close to 
each other (less than $100 difference 
for data that is over $5000 means a 


difference of less than 2%) . Mean = 
$5,171.19 

. The best measure of variation for this 
data is the standard deviation . We 
are not comparing data sets so the 
coefficient of variation is not 
necessary and though the IQR could 
work, it is not well-known . Therefore 
the standard deviation is the best bet. 
Standard deviation: $458.49 

. The managers’ claim is that the 
restaurant revenue has increased since 
the downturn. The average prior to 
the downturn was $5,024.25. The 
mean for this data set is $5,171.91, 
which suggests that the average has 
increased, but we need to take into 
account variation. They typical range 
for this data set is $4,712.70 to 
$5,629.68 (found by taking the mean 
+/- standard deviation; 
5171.91-458.49, 5171.91 + 458.49). 
Once we take variation into account 
we can see that the downturn average 
falls in the interval (i.e. 5024.25 is 
between 4712.7 and 5629.68). This 
means that it is unclear if the average 
daily revenue has changed since the 
downturn. This means that the data 
neither supports nor refutes the 
managers’ belief, but it does appear to 


suggest that the average has not 
increased. 


Which sales method wins? 


“Tm telling you that if we reduce the price, 
we'll sell more!” yelled Feras. 


“And I’m telling you that the loss in gross profit 
will outweigh the added sales!” exclaimed 
Sabrina. 


Cheryl sighed. These meetings always devolved 
this way. Every year they had the same debate 
when it came to introducing the holiday gift 
packages. Feras and Sabrina were always on 
opposite sides of how to price the Hot Trio Gift 
Basket (which included three beverage 
selections: Fine Grind Dark Coffee, French 
Vanilla Cappuccino and Creamy Hot Chocolate) 
at the beginning of the Christmas buying 
season. 


“OK. I’m tired of this debate. We are going to 
end this once and for all. This year we are 
going to implement both of your ideas. 
Whichever method demonstrates the best 
results will be the method we use next year. 
Understood?” Both Feras and Sabrina gleefully 
nodded. Each couldn’t wait to show the other 
one up. 


After much discussion, they decided on a study 
design. For their 120 coffee shops across 
Western Canada, 60 shops were chosen at 
random to introduce the gift basket at the 
regular price of $20. The other 60 shops would 
introduce the gift basket by offering a limited 
two-week introductory price that was 10-20% 
less than the regular price. (Each of these shops 
choose a discount of between 10% and 20%.) 
After the two weeks, they would then increase 
the price back to the regular selling price of 
$20.00. None of the 120 shops would advertise 
the release of the Hot Trio Gift Basket, but 
instead would display the gift basket in a 
predetermined way. Total sales of the gift 
baskets would be computed for each store for 
the first two weeks they were introduced (see 
attached Excel spreadsheet for the results). 
Cheryl decided to look both at total number of 
gift baskets sold (in columns B and G in Excel 
file) and gross profit (in columns D and I). The 
cost of the gift basket to produce is $10. 


1. What are the variables being studied? 
Categorize each of them. 

2. Analyze how the study was conducted. To 
do this, answer the following questions. 


1. Provide one good thing about how the 
study was designed. 
2. Provide one flaw with how the study 


was designed. 


3. Compare the total number of gift baskets 
sold per store for each type of price (i.e. 
the introductory discount vs. the regular 
price). To do this answer the following 
questions. 


1. Create a boxplot for the total number 
of gift baskets sold for each type of 
price. Insert both of them here. 

2. Are there any outliers? What are 
they? If there are outliers, what do 
they mean in the context of the story. 
If not, what does that mean in the 
context of the story. 

3. Interpret both of the box plots. In 
particular, comment on where the box 
plots are centred and the variability in 
each data set. 

4. Based on the above, is there a type of 
price that results in a higher total 
number of gift baskets sold? Explain 
your answer. 


4. Compare the gross profit for each type of 
price (i.e. the introductory discount vs. the 
regular price). To do this answer the 
following questions. 


1. Determine the best measure of centre 


to use for the gross profit for each 
type of price. Provide this measure of 
centre for each data set. Explain why 
it is the best measure of centre for 
these data sets. 

2. What do the measures of centre tell 
you about the data sets in the context 
of the story? 

3. Determine the best measure of 
variation to use for gross profit for 
each type of price. Provide this 
measure of variation for each data set. 
Explain why it is the best measure of 
variation for these data sets. 

4. What do the measures of variation tell 
you about the data sets in the context 
of the story? 

5. Based on the above, is there a type of 
price that results in a higher gross 
profit? Explain your answer. 


5. Based on what your answers in questions 3 
and 4, which type of price is better? 
Provide a reason for your choice. Your 
explanation should include be based both 
in statistics and business. 


1. Number of gift baskets sold — quantitative 
discrete; 
Gross profit — quantitative continuous 


2 


1. 1) Shops were chosen at random to 


determine if they would introduce the 
gift basket or not. This ensures that 
the sample is random. 

2) The gift baskets were all the same 
and they were presented in the stores 
the same way. This ensures that the 
customers were getting a consistent 
framing of the product. 


. 1) By having the discount stores 


choose their discount, there was no 
consistency in the amount of the 
discount—they ranged from 10% to 
20%. This can cause potential 
problems in determining which sales 
method is better without doing a 
more detailed analysis of how the 
discounts did relative to each other. 
2) Further, there was no pairing of 
stores. That is, to properly compare 
this it would be beneficial to pair 
similar stores (i.e. similar sales, 
customers, locations, etc.) to ensure 
that we are comparing the sales fairly. 
By not doing this, this study has not 
limited the variables. If there is a 
difference in the methods, it could be 
explained by the stores having 
different demographics. For example, 
all of the non-discount stores could be 
located in more affluent 


neighbourhoods. 

3) Another issue is the lack of 
advertising. Though this may have 
been done to avoid issues of two 
stores close to each other not having 
the same pricing, it does lead to issues 
in the study. If people knew that the 
sale was available at certain stores, 
would they have considered driving to 
the store for the discount? Thus the 
lack of advertising results in the true 
potential of the sale being diminished. 


. Boxplot for Exercise 2 


BBoxPlat of volume sold (no discount) 


200-295 300305310315. 


. There is a mild outlier (207) for the 
stores that had discounts. This means 
that this store sold significantly less 
gift baskets than the other stores of 
this type. 

There are no outliers for the stores 
that had no discounts. This means 
that none of the stores sold 
significantly more or less gift baskets 
than the other stores of this type. 

. The boxplot for the stores that had a 
discount is centred at the median of 
320 while the stores without a 


discount are centred at 304. Thus, on 
average, the discount stores sold 320 
gift baskets in the first two weeks of 
sales, while the non-discount stores 
sold 304. The middle 50% of the 
stores under the discount scheme sold 
between 302 and 337 units, while the 
middle 50% of the stores under the 
regular price scheme sold between 
292 and 318 units. 

Further, we can see that more than 
25% of the discount stores sold over 
330 gift baskets, while the non- 
discount stores had no stores that sold 
over 330. 

The IQR for the discount stores is 
35.25 while for the non-discount 
stores is 26.25. This suggests that the 
non-discount stores had a more 
consistent volume of gift baskets sold 
compared to the discount stores. 

. The discount stores appear to sell 
more gift baskets. This indicated both 
by a larger median and that 25% of 
their stores have sold more than 330 
gift baskets while none of the non- 
discount stores have achieved selling 
more than 330 gift baskets. Though 
the non-discount stores have more 
consistent results in the volume sold, 
the less variation is centred around a 


lower number of gift baskets sold. 


. As there is a low mild outlier in the 
discount gross profit data, we need to 
compare the mean and median to see 
if the mean gross profit of the 
discount stores would be pulled down 
by the outlier. This could skew the 
results. The mean is $2193.90 and the 
median is $2199.50. As they are 
similar we could use either of them, 
but as the mean is almost $6 less, I 
will use the median. Though the 
regular priced stores do not have an 
outlier, we want to use the same 
measure of centre for both so that we 
are comparing like values (i.e. median 
to median). 

Median discount stores = $2,199.50 
Median non-discount stores = 
$3,040.00 

. The median for the non-discount 
stores is greater than the median of 
the discount stores. This suggests that, 
even though the discount stores sold 
more gift baskets, the gross profit for 
the non-discount stores is, on average, 
greater than the discount stores. 

. To compare the amount of variation, 
we want to use a relative measure as 
the means are not the same. Thus we 


want to use the coefficient of 
variation. 

Coefficient of variation discount stores 
= 8.44% 

Coefficient of variation non discount 
stores = 4.45% 

4. As the coefficient of variation is less 
than 10% for each type of store, this 
suggests that all of the stores had very 
little variation amongst their gross 
profits. The non-discount stores 
though did have the smallest 
coefficient of variation, which means 
they had the least variation in their 
gross profit. This means that these 60 
stores had consistent gross profit. 

5. For the gross profit, the non-discount 
stores did significantly better, on 
average, than the discount stores as 
determined by the median. 

Further the variation for the gross 
profit was lowest with the non- 
discount stores (4.45% vs. 8.44%). 
This suggests that not only do the 
non-discount stores provide the best 
profit margin, but they also are more 
consistent than the discount stores in 
achieving it. 


In conclusion, Feras was correct that the 


discount stores would sell more, but 
Sabrina was also correct that the loss in 
profit would not outweigh the advantage 
of selling more. Based on the assumption 
that the company is interested in making a 
greater gross profit to further the dividends 
of their stockholders, the company is better 
off selling the gift baskets with no 
discount. This is supported by the median 
gross profit ($3,040.00) for the regular 
priced stores being greater than the 
median gross profit of the discount stores 
($2,199.50). Also, the gross profit of the 
regular priced stores is more consistent 
than the gross profit of the discount stores. 
Alternative conclusion: 

The above conclusion is reached based on 
the assumptions that the company is a 
publicly traded company whose goal is to 
raise dividends for their shareholders. 
Though this is a fair assumption, it should 
be noted that the discounted gift baskets 
could be a better option if the company is 
using it as a marketing ploy to get more 
customers to come into the store. If that is 
their goal, then the discounts are a better 
option. 


Amal shoved a piece of paper at Karim as he 
walked in the door. “Look at this. I went for my 


dentist check-up last week and they charged me 
$273.45.” 


“That’s ok. Right? Your insurance covers it.” 
Karim hedged as he shoved off his jacket. 


“You’d think so. But nooooo. They cover a 
maximum of $217.91. So, I owe the dentist 
money. It is such a scam.” Amal threw her 
hands up in frustration. 


“We can cover the ... $60 difference. It isn’t a 
big deal.” He soothed as he gave her a kiss. 


“We can cover it. But not every family can. 
People shouldn’t have to make choices between 
paying for food and rent or getting proper 
health care. We apparently live in the richest 
province in Canada yet we don’t even cover 
basic dental care. It’s disgusting.” 


Karim decided it wasn’t the right time to bring 
up that the dental prices were probably high 
because they lived in the richest province in 
Alberta. He waited as she took a few breaths. 


Amal’s shoulders slumped. “I’m sorry. You 
probably don’t want to hear this as you walk in 
the door. But it makes me so angry. Look I 
called a bunch of our friends and I found out 
how much their dentists charged them for a 
basic check-up. You wouldn’t believe what I 


found out.” 


Amal pulled Karim over to her laptop where a 
spreadsheet was open with many figures and 
numbers. Beside the name of their friends was a 
dollar amount representing the amount of 
money each person had been charged by their 
dentist. (see the Excel sheet posted on BB) 


“Once I got the data, I worked out how much 
they would have to pay out of pocket. That’s 
this column.” Amal pointed to the third column. 
“Tf they were charged under $217.91, I put O in 
as they wouldn’t have to pay anything out of 
pocket. Then I did an analysis of how much 
people are being overcharged in Alberta. But 
then I thought this wasn’t enough. So I called 
some of our friends back home in Ontario. They 
get covered for a maximum of $219 by their 
insurance companies (thanks internet). I looked 
at how much they had to pay out of pocket.” 
Amal pointed to the fifth and sixth column on 
the Excel sheet as she was talking. “It’s crazy 
what I found.” 


1. What are the two variables being studied? 
Categorize each of them. 

2. Critique how Amal collected her data. To 
do this answer the following questions. 


1. What type of sampling technique did 


Amal use? 

2. Amal is trying to make an inference 
from her sample about the population 
of all Albertans. Is this appropriate? 
Explain. 


3. Create a boxplot for the amount people 
have to pay out of pocket in Alberta and 
another for what they have to pay out of 
pocket in Ontario using MegaStat. Insert 
them both here. 

4. Create the numerical descriptive statistics 
using MegaStats for both data sets. Include 
all appropriate numerical descriptive 
statistics. Insert the results below. 

5. Analyze the measures of centre. To do this 
answer the following questions. 


1. Are there any outliers? What are 
they? If there are outliers, what do 
they mean in the context of the story. 
If not, what does that mean in the 
context of the story. Make sure you 
comment on both data sets. 

2. Determine the best measure of centre 
to use for comparing the two data 
sets. Provide this measure of centre 
for each data set. Explain why it is the 
best measure of centre for these data 
sets. 

3. Based on the above, make some initial 


conclusions on whether there is a 
difference in how much people pay 
out of pocket to see a dentist for a 
basic check-up in Alberta vs. Ontario, 
on average. 


6. Analyze the measures of variation. To do 
this answer the following questions. 


1. Investigate the IQR: 


1. In general, what is the IQR? 

2. State the IQR for each data set. 

3. What does the IQR tell us about 
the data? 


2. Investigate the coefficient of 
variation: 


1. In general, what is the coefficient 
of variation? 

2. State the coefficient of variation 
for each data set. 

3. What does the coefficient of 
variation tell us about the data? 


3. The IQR and the coefficient of 
variation are telling us two different 
things about the data. Explain why 
this is occurring. 

4. In this situation, which is the best 
measure of variation to use when 


comparing the two data sets? Explain 
your reasoning? 


7. Taking into account both centre and 
variation, do Albertans pay more out of 
pocket, on average, than Ontarians for a 
basic check-up at the dentist? Explain your 
answer by referring to the results in both 
questions 5 and 6. 


1. Amount paid out of pocket for basic visit 
to dentist by Albertans — quantitative 
continuous 
Amount paid out of pocket for basic visit 
to dentist by Ontarians — quantitative 
continuous 


2. 1. Convenience sampling 

2. It would not be appropriate because 
for a sample to be used to make an 
inference, it needs to be a good 
sample. This means it needs to be 
random, representative and a large 
enough sample size. As she is only 
asking her friends not every member 
of the population has an equal chance 
of being chosen, therefore it is not a 
random sample. Though she may have 
a diverse group of friends, it is 
unlikely that her friends form a 
representative sample of the 


populations of each province. 


3. Boxplot for Exercise 3 


Figure 1: 
BoxPlot of amount Albertans pay out of pocket for basic dentist 
appointment 


0 20 40 60 80 100 120 
Pay out of pocket ($) 


BoxPlot of amount Ontarians pay out of pocket 


0 5 10 15 20 25 30 35 40 
Pay out of pocket ($) 


4. Descriptive statistics for Exercise 3 | 


(Alberta) 
Table 1: Numerical Descriptive 
statistics for Albertans 

Pay out of pocket 
count 16 
mean 64.3963 
sample variance 1,062.2444 
sample standard 
deviation 32.5921 
minimum 0 
maximum 109.75 
range 109.75 
skewness -0.5304 
kurtosis -0.1257 
coefficient of variation 
(CV) 50.61% 
1st quartile 51.6175 
median 62.6650 
3rd quartile 90.4275 
interquartile range 38.8100 
mode #N/A 
low extremes 0 
low outliers 0 
high outliers 0 
high extremes 0 


Descriptive statistics for Exercise 3 


(Ontario) 


Table 2: Numerical 
Descriptive statistics 
for Ontarians 


Pay out of pocket 
count 11 
mean 18.9809 
sample variance 205.8056 
sample standard 
deviation 14.3459 
minimum 0 
maximum 37.94 
range 37.94 
skewness -0.0588 
kurtosis -1.6754 
coefficient of variation 
(CV) 75.58% 
1st quartile 6.8950 
median 18.9500 
3rd quartile 31.0000 
interquartile range 24.1050 
mode #N/A 
low extremes 0 
low outliers 0 
high outliers 0 
high extremes Oo. 


5. 1. Neither of the data sets has outliers. 
This means that none of Amal’s 
friends pay significantly more or less 
out of pocket than any of her other 
friends. 

2. As there are no outliers in either data 
sets, the mean cost out of pocket is 
the best measure of centre to use as 
the mean is not being skewed by the 
outliers and the mean is a well-known 
measure of centre. 

Mean for Albertans= $64.40 
Mean for Ontarians= $18.98 


3. The Ontarians appear to be paying 


1; 


significantly less than Albertans out of 
pocket for basic visits to the dentist. 
In particular, they are paying, on 
average, $45.42 less (based on the 
mean). 


1. The IQR is the interquartile 
range. It is the distance between 
the first and third quartiles. On 
the boxplot, it is the length of the 
box. It is the range where the 
middle 50% of the data values 
fall. 

2. IQR for Albertans= $38.81 
IQR for Ontarians= $24.11 

3. The cost out of pocket for 
Albertans is more volatile than 
for Ontarians, which suggests 
that Albertans are seeing greater 
differences in out of pocket 
prices than Ontarians (based on 
the IQR). 


1. The coefficient of variation is a 
measure of relative variation. It 
is the ratio of the standard 
deviation to the mean. Therefore, 
it tells us how much variation 
there is relative to the mean. It is 
useful to use when comparing 


two or more data sets that have 
different means as other 
measures of variation may be 
inflated by larger data values in 
the data set, while the coefficient 
of variation rarely is. 

2. Coefficient of variation for 
Albertans = 50.61% 

Coefficient of variation for 
Ontarians = 75.58% 

3. The cost out of pocket for 
Ontarians is more volatile than 
for Albertans, which suggests 
that Ontarians are seeing greater 
differences in out of pocket 
prices than Albertans (based on 
the coefficient of variation). 


3. The IQR is larger for Albertans 
because the data values are larger. 
Therefore, the IQR is showing an 
inflated amount of variation for 
Albertans (compared to Ontarians). 
The CofV, on the other hand, takes 
into account the difference in the size 
of the data values (as it is a relative 
measure). This means it is not being 
inflated by the size of the data values. 

4. To compare the amount of variation, 
we want to use a relative measure as 
the means are not the same. Thus we 


want to use the coefficient of 
variation. 


7. On average, Albertans pay $64.40 out of 
pocket for a basic dental visit compared to 
$18.98 for Ontarians. But there is more 
variation for Ontarians compared to 
Albertans, which suggest that though 
Albertans pay more, their costs are more 
consistent (relative to the average). If we 
take into account variation when finding 
the typical range of values that people pay 
out of pocket, we find that 
Albertans typically pay $31.81 to $96.99 
for a basic dental checkup. While patients 
in Ontario from the sample paid on 
average $4.63 to $33.33 out of pocket for 
a basic dental checkup. As the typical 
range of values overlap, we cannot say for 
certain that Albertans pay more on average 
than Ontarians. 


Introduction -- Probability Topics -- MtRoyal - 
Version2016RevA 

class = "introduction" Meteor showers are rare, but 
the probability of them occurring can be calculated. 
(credit: Navicore/flickr) 


Chapter objective 
By the end of this chapter, the student should be 


able to: 


Understand and use the terminology of 
probability. 

Determine whether two events are mutually 
exclusive and whether two events are 
independent. 

Calculate probabilities using the addition and 
multiplication rules. 

Construct and interpret contingency tables and 
tree diagrams. 

Understand the difference between likely and 
unlikely events. 


It is often necessary to "guess" about the outcome of 
an event in order to make a decision. Politicians 
study polls to guess their likelihood of winning an 
election. Teachers choose a particular course of 
study based on what they think students can 
comprehend. Doctors choose the treatments needed 
for various diseases based on their assessment of 
likely results. You may have visited a casino where 
people play games chosen because of the belief that 
the likelihood of winning is good. You may have 
chosen your course of study based on the probable 
availability of jobs. 


You have, more than likely, used probability. In 
fact, you probably have an intuitive sense of 
probability. Probability deals with the chance of an 
event occurring. Whenever you weigh the odds of 
whether or not to do your homework or to study for 
an exam, you are using probability. In this chapter, 
you will learn how to solve probability problems 
using a systematic approach. 


Terminology -- Probability Topics -- MtRoyal - 
Version2016RevA 


Probability is a measure that is associated with how 
certain we are of outcomes of a particular 
experiment or activity. An experiment is a planned 
operation carried out under controlled conditions. If 
the result is not predetermined, then the experiment 
is said to be a chance experiment. Flipping one fair 
coin twice is an example of an experiment. 


A result of an experiment is called an outcome. The 
sample space of an experiment is the set of all 
possible outcomes. Three ways to represent a 
sample space are: to list the possible outcomes, to 
create a tree diagram, or to create a Venn diagram. 
The uppercase letter S is used to denote the sample 
space. For example, if you flip one fair coin, S = 
{H, T} where H = heads and T = tails are the 
outcomes. 


An event is any combination of outcomes. Upper 
case letters like A and B represent events. For 
example, if the experiment is to flip one fair coin, 
event A might be getting at most one head. The 
probability of an event A is written P(A). 


The probability of any outcome is the long-term 
relative frequency of that outcome. Probabilities 
are between zero and one, inclusive (that is, zero 
and one and all numbers between these values). 


P(A) = O means the event A can never happen. P(A) 
= 1 means the event A always happens. P(A) = 0.5 
means that event A has a 50% chance of happening. 
For example, if you flip one fair coin repeatedly 
(from 20 to 2,000 to 20,000 times) the relative 
frequency of heads approaches 0.5 (the probability 
of heads). 


Equally likely means that each outcome of an 
experiment occurs with equal probability. For 
example, if you toss a fair, six-sided die, each face 
(1, 2, 3, 4, 5, or 6) is as likely to occur as any other 
face. If you toss a fair coin, a Head (H) and a Tail 
(T) are equally likely to occur. If you randomly 
guess the answer to a true/false question on an 
exam, you are equally likely to select a correct 
answer or an incorrect answer. 


To calculate the probability of an event A when 
all outcomes in the sample space are equally 
likely, count the number of outcomes for event A 
and divide by the total number of outcomes in the 
sample space. For example, if you toss a fair dime 
and a fair nickel, the sample space is {HH, TH, HT, 
TT} where T = tails and H = heads. The sample 
space has four outcomes. A = getting one head. 
There are two outcomes that meet this condition 
{HT, TH}, so P(A) = 24 = 0.5. 


Suppose you roll one fair six-sided die, with the 
numbers {1, 2, 3, 4, 5, 6} on its faces. Let event E = 


rolling a number that is at least five. There are two 
outcomes {5, 6}. P(E) = 26. If you were to roll the 
die only a few times, you would not be surprised if 
your observed results did not match the probability. 
If you were to roll the die a very large number of 
times, you would expect that, overall, 26 of the rolls 
would result in an outcome of "at least five". You 
would not expect exactly 26. The long-term relative 
frequency of obtaining this result would approach 
the theoretical probability of 26 as the number of 
repetitions grows larger and larger. 


This important characteristic of probability 
experiments is known as the law of large numbers 
which states that as the number of repetitions of an 
experiment is increased, the relative frequency 
obtained in the experiment tends to become closer 
and closer to the theoretical probability. Even 
though the outcomes do not happen according to 
any set pattern or order, overall, the long-term 
observed relative frequency will approach the 
theoretical probability. (The word empirical is 
often used instead of the word observed.) 


It is important to realize that in many situations, the 
outcomes are not equally likely. A coin or die may 
be unfair, or biased. Two math professors in 
Europe had their statistics students test the Belgian 
one Euro coin and discovered that in 250 trials, a 
head was obtained 56% of the time and a tail was 
obtained 44% of the time. The data seem to show 


that the coin is not a fair coin; more repetitions 
would be helpful to draw a more accurate 
conclusion about such bias. Some dice may be 
biased. Look at the dice in a game you have at 
home; the spots on each face are usually small holes 
carved out and then painted to make the spots 
visible. Your dice may or may not be biased; it is 
possible that the outcomes may be affected by the 
slight weight differences due to the different 
numbers of holes in the faces. Gambling casinos 
make a lot of money depending on outcomes from 
rolling dice, so casino dice are made differently to 
eliminate bias. Casino dice have flat faces; the holes 
are completely filled with paint having the same 
density as the material that the dice are made out of 
so that each face is equally likely to occur. Later we 
will learn techniques to use to work with 
probabilities for events that are not equally likely. 


A key concept in probability is whether an event is 
likely or unlikely. A likely event is an event that 
has a good chance of happening, while an unlikely 
event is rare. For example, it is likely to snow in 
Calgary in the winter, but it is unlikely to snow in 
Calgary in the summer (it can happen, but it would 
be a rare or strange event). In general, in statistics, 
unlikely events usually have a probability of less 
than 1% of happening. Likely events usually have a 
probability of greater than 10% of happening. If the 
probability of the event is between 1% and 10%, it 


is up to the statistician or researcher to make a call 
to determine whether it is likely or unlikely. 


"OR" Event: 

An outcome is in the event A OR B if the outcome is 
in A or is in B or is in both A and B. For example, let 
A = {1, 2, 3, 4,5} and B = {4, 5, 6, 7,8}. AORB 
= {1, 2, 3, 4, 5, 6, 7, 8}. Notice that 4 and 5 are 
NOT listed twice. 


"AND" Event: 

An outcome is in the event A AND B if the outcome 
is in both A and B at the same time. For example, let 
A and B be {1, 2, 3, 4, 5} and {4, 5, 6, 7, 8}, 
respectively. Then A AND B = {4, 5}. 


The complement of event A is denoted A’ (read "A 
prime"). A’ consists of all outcomes that are NOT in 
A. Notice that P(A) + P(A’) = 1. For example, let S 
= {1, 2, 3, 4, 5, 6} and let A = {1, 2, 3, 4}. Then, A 
’ = {5, 6}. P(A) = 46, P(A’) = 26, and P(A) + P(A 
)=46+26=1 


The conditional probability of A given B is written 
P(A|B). P(A|B) is the probability that event A will 
occur given that the event B has already occurred. A 
conditional reduces the sample space. We 
calculate the probability of A from the reduced 
sample space B. The formula to calculate P(A|B) is 
P(A|B) = P(AANDB) P(B) where P(B) is greater than 
zero. 


For example, suppose we toss one fair, six-sided die. 
The sample space S = {1, 2, 3, 4, 5, 6}. Let A = 
face is 2 or 3 and B = face is even (2, 4, 6). To 
calculate P(A|B), we count the number of outcomes 


2 or 3 in the sample space B = {2, 4, 6}. Then we 
divide that by the number of outcomes B (rather 
than S). 


We get the same result by using the formula. 
Remember that S has six outcomes. 


P(A|B) = P(AANDB) P(B) = (the number of 
outcomes that are 2 or 3 and even in S) 6 (the 
number of outcomes that are even inS)6 = 1636 
= Ae3 


Odds 

The odds of an event presents the probability as a 
ratio of success to failure. This is common in various 
gambling formats. Mathematically, the odds of an 
event can be defined as: 

P(A) 1—P(A) 


where P(A) is the probability of success and of 
course 1 — P(A) is the probability of failure. Odds 
are always quoted as "numerator to denominator," 
e.g. 2 to 1. Here the probability of winning is twice 
that of losing; thus, the probability of winning is 
0.66. A probability of winning of 0.60 would 
generate odds in favor of winning of 3 to 2. While 
the calculation of odds can be useful in gambling 
venues in determining payoff amounts, it is not 
helpful for understanding probability or statistical 
theory. 


Understanding Terminology and Symbols 


It is important to read each problem carefully to 
think about and understand what the events are. 
Understanding the wording is the first very 
important step in solving probability problems. 
Reread the problem several times if necessary. 
Clearly identify the event of interest. Determine 
whether there is a condition stated in the wording 
that would indicate that the probability is 
conditional; carefully identify the condition, if any. 


If the sample space is 


then P(A|B) is found by looking only at events that 
involved B: 


and within B looking at the portion that involve A: 


That portion is clearly the intersection of A and B. 


The sample space S is the whole numbers 
starting at one and less than 20. 


Let event A = the even numbers and 
event B = numbers greater than 13. 
ey A ,B= 


AS Co hee eee CEN ees 
.A AND B = , A ORB = 


. P(A AND B) = , P(A OR B) = 


A’ = ti«SC#P(AYY = 

.P(A) + P(AQ=_ 

.P(A|B) = __ is zP(BIAD = ___S 
are the probabilities equal? 


oS 09456 7. 8 O10, dd 19, 
13, 14, 15, 16, 17, 18, 19} 

_A = {2, 4, 6, 8, 10, 12, 14, 16, 18}, B = 
{14, 15, 16, 17, 18, 19} 

_ P(A) = 919, P(B) = 619 

_A AND B = {14,16,18}, A ORB = {2, 4, 
6, 8, 10, 12, 14, 15, 16, 17, 18, 19} 


. P(A AND B) = 319, P(A ORB) = 1219 
eA tao Oi ek blo FA 
) = 1019 

. P(A) + P(A) = 1(919 + 1019 = 1) 

. P(A|B) = P(AANDB) P(B) = 3 6, P(BIA) 
= P(AANDB) P(A) = 39, No 


Try It 


The sample space S is the ordered pairs of two 
whole numbers, the first from one to three and 
the second from one to four (Example: (1, 4)). 


1S = 


Let event A = the sum is even and event 
B = the first number is prime. 

Ne. oe = 

. P(A) = 

.A AND B = 


_P(A AND B) = ___, P(A ORB) = 


BOS 22 = PB) = 

.P(A) + PAX =_ 

. P(A|B) = , P(B|A) = 
are the probabilities equal? 


Sate ie 2 les Glee) G2 eke (2 2): 
(2,3), (2,4), (3,1), (3,2), (3,3), @,4)} 

2A Ces C2 22 A asa): 
(3,3) 


B= {(2,1), (2,2), (2,3), (2,4), (3,1), (3,2), 
(3,3), (3,4)} 


3. P(A) = 12, P(B) = 23 
4.A AND B = {(2,2), (2,4), (G,1), (3,3)} 


ALOR B = 4(1e)de3), 1) 2), (23), 
(254), KS S23) Sa 

5. P(A AND B) = 13, P(AORB) = 56 

6. B’ = {(1,1), (1,2), (1,3), (1,4)}, P(BY) = 
13 

7. P(B) + P(B) = 1 

8. P(A|B) = P(A AND B) P(B) = 1 2, PBI 
A) = P(A AND B) P(B) = 23, No. 


A fair, six-sided die is rolled. Describe the 
sample space S, identify each of the following 
events with a subset of S and compute its 
probability (an outcome is the number of dots 
that show up). 


1. Event T = the outcome is two. 

2. Event A = the outcome is an even 
number. 

3. Event B = the outcome is less than four. 

4. The complement of A. 

5. A GIVEN B 

6. B GIVEN A 

7. A AND B 


8.A ORB 
9. A OR B’ 
10. Event N = the outcome is a prime 
number. 
11. Event J = the outcome is seven. 


: 55S a nuh a 
lee oy es) 

ee oe Ad) — eee 
| = OK, 2) = 1S 
BIA = {2}, P(BJA) = 13 
A AND B = {2}, P(A AND B) = 16 
AOR B= 41,2, 3) 4, 6}, PA OR B) = 56 
A OR B’ = {2, 4,5, 6}, P(A OR B’) = 23 
10. N = {2, 3, 5}, P(N) = 12 
11. A six-sided die does not have seven dots. 
P77) — 0: 


ir 
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ve 


[link] describes the distribution of a random 
sample S of 100 individuals, organized by gender 
and whether they are right- or left-handed. 
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Let’s denote the events M = the subject is 
male, F = the subject is female, R = the 
subject is right-handed, L = the subject is left- 
handed. Compute the following probabilities: 


P(M) 

Pe) 

P(R) 

P(L) 

. P(M AND R) 
. PF AND L) 
P(M OR F) 
. P(M OR R) 
. P(F OR L) 
10. P(M) 

11. P(R|M) 
12. P(FIL) 

13. P(L|F) 


SONA AWN SS 


\O 


P(M) = 0.52 
P(F) = 0.48 
P(R) = 0.87 
P(L) = 0.13 
. P(M AND R) = 0.43 
. PF AND L) = 0.04 


AURWN 


7. P(M OR F) = 1 

8. PWM OR R) = 0.96 

9. PF OR L) = 0.57 

0. PM’) = 0.48 

1. P(R|M) = 0.8269 (rounded to four 
decimal places) 

12. P(F|L) = 0.3077 (rounded to four decimal 

places) 
13. P(L|F) = 0.0833 
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Chapter Review 


In this module we learned the basic terminology of 
probability. The set of all possible outcomes of an 
experiment is called the sample space. Events are 
subsets of the sample space, and they are assigned a 
probability that is a number between zero and one, 
inclusive. 


Formula Review 

A and B are events 

P(S) = 1 where S is the sample space 
0 < P(A) < 1 


P(A|B) = P(A MB) P(B) 


In a particular college class, there are male and 
female students. Some students have long hair 
and some students have short hair. Write the 
symbols for the probabilities of the events for 
parts a through j. (Note that you cannot find 
numerical answers here. You were not given 
enough information to find any probability 
values yet; concentrate on understanding the 
symbols.) 


¢ Let F be the event that a student is female. 

¢ Let M be the event that a student is male. 

* Let S be the event that a student has short 
hair. 

* Let L be the event that a student has long 
hair. 


1. The probability that a student does not 


SCOMNAUDAWNE 


— 


have long hair. 

. The probability that a student is male or 
has short hair. 

. The probability that a student is a female 
and has long hair. 

. The probability that a student is male, 
given that the student has long hair. 

. The probability that a student has long 
hair, given that the student is male. 

. Of all the female students, the probability 
that a student has short hair. 

. Of all students with long hair, the 
probability that a student is female. 

. The probability that a student is female or 
has long hair. 

. The probability that a randomly selected 
student is a male student with short hair. 
. The probability that a student is female. 


. PIL’) = P(S) 
. P(M OR S) 

. P(F AND L) 

. P(M|L) 
P(L|M) 
P(S|F) 
P(FIL) 

. P(F OR L) 

. P(M AND S) 
P(F) 


Use the following information to answer the next four 
exercises. A box is filled with several party favors. It 
contains 12 hats, 15 noisemakers, ten finger traps, 
and five bags of confetti. 

Let H = the event of getting a hat. 

Let N = the event of getting a noisemaker. 

Let F = the event of getting a finger trap. 

Let C = the event of getting a bag of confetti. 


Find P(A). 


Find P(N). 


P(N) = 15 42 = 514 = 0.36 


Find P(F). 


Find P(C). 


P(C) = 5 42 = 0.12 


Use the following information to answer the next six 
exercises. A jar of 150 jelly beans contains 22 red 
jelly beans, 38 yellow, 20 green, 28 purple, 26 blue, 
and the rest are orange. 


Let B = the event of getting a blue jelly bean 

Let G = the event of getting a green jelly bean. 
Let O = the event of getting an orange jelly bean. 
Let P = the event of getting a purple jelly bean. 
Let R = the event of getting a red jelly bean. 

Let Y = the event of getting a yellow jelly bean. 


Find P(B). 


Find P(G). 


P(G): = 20-150. = 2.15: =013 
Find P(P). 


Find P(R). 


P(R) = 22150 = 1175 = 0.15 


Find PCY). 


Find P(O). 


P(O) = 150-22-38-20-28-26 150 = 16150 = 8 


75 = 0.11 


Use the following information to answer the next six 
exercises. There are 23 countries in North America, 
12 countries in South America, 47 countries in 
Europe, 44 countries in Asia, 54 countries in Africa, 
and 14 in Oceania (Pacific Ocean region). 

Let A = the event that a country is in Asia. 

Let E = the event that a country is in Europe. 

Let F = the event that a country is in Africa. 

Let N = the event that a country is in North 
America. 

Let O = the event that a country is in Oceania. 
Let S = the event that a country is in South 
America. 


Find P(A). 


Find P(E). 


P(E) = 47 194 = 0.24 


Find P(F). 


Find P(N). 


P(N) = 23 194 = 0.12 


Find P(O). 


Find P(S). 


P(S) = 12 194 = 6 97 = 0.06 


What is the probability of drawing a red card in 
a standard deck of 52 cards? 


What is the probability of drawing a club in a 
standard deck of 52 cards? 


1352 = 14 = 0.25 


What is the probability of rolling an even 
number of dots with a fair, six-sided die 
numbered one through six? 


What is the probability of rolling a prime 
number of dots with a fair, six-sided die 
numbered one through six? 


Use the following information to answer the next two 
exercises. You see a game at a local fair. You have to 
throw a dart at a color wheel. Each section on the 
color wheel is equal in area. 


Let B = the event of landing on blue. 
Let R = the event of landing on red. 
Let G = the event of landing on green. 
Let Y = the event of landing on yellow. 


If you land on Y, you get the biggest prize. Find 
PCY). 


If you land on red, you don’t get a prize. What 
is P(R)? 


P(R)= 48 =0.5 


Use the following information to answer the next ten 
exercises. On a baseball team, there are infielders 
and outfielders. Some players are great hitters, and 
some players are not great hitters. 

Let I = the event that a player in an infielder. 

Let O = the event that a player is an outfielder. 

Let H = the event that a player is a great hitter. 

Let N = the event that a player is not a great hitter. 


Write the symbols for the probability that a 
player is not an outfielder. 


Write the symbols for the probability that a 
player is an outfielder or is a great hitter. 


P(O OR H) 


Write the symbols for the probability that a 
player is an infielder and is not a great hitter. 


Write the symbols for the probability that a 
player is a great hitter, given that the player is 
an infielder. 


PCH|D 


Write the symbols for the probability that a 
player is an infielder, given that the player is a 
great hitter. 


Write the symbols for the probability that of all 
the outfielders, a player is not a great hitter. 


P(N|O) 


Write the symbols for the probability that of all 
the great hitters, a player is an outfielder. 


Write the symbols for the probability that a 
player is an infielder or is not a great hitter. 


PU OR N) 


Write the symbols for the probability that a 
player is an outfielder and is a great hitter. 


Write the symbols for the probability that a 
player is an infielder. 


P(D) 


What is the word for the set of all possible 
outcomes? 


What is conditional probability? 


The likelihood that an event will occur given 
that another event has already occurred. 


A shelf holds 12 books. Eight are fiction and the 
rest are nonfiction. Each is a different book 
with a unique title. The fiction books are 
numbered one to eight. The nonfiction books 
are numbered one to four. Randomly select one 
book 

Let F = event that book is fiction 

Let N = event that book is nonfiction 

What is the sample space? 


What is the sum of the probabilities of an event 
and its complement? 


Use the following information to answer the next two 
exercises. You are rolling a fair, six-sided number 
cube. Let E = the event that it lands on an even 
number. Let M = the event that it lands ona 
multiple of three. 


What does P(E|M) mean in words? 
What does P(E OR M) mean in words? 


the probability of landing on an even number 
or a multiple of three 


Homework 


Total 18-34 35-44 45-54 55-64 65+ Male Female 
| Sample {Percent approve || Percent disapprove 


The graph in [link] displays the sample sizes 
and percentages of people in different age and 
gender groups who were polled concerning 
their approval of Mayor Ford’s actions in office. 
The total number in the sample of all the age 
groups is 1,045. 


1, 
2. 


fo 8 


4. 


10. 


Define three events in the graph. 
Describe in words what the entry 40 
means. 

Describe in words the complement of the 
entry in question 2. 

Describe in words what the entry 30 
means. 


. Out of the males and females, what 


percent are males? 


. Out of the females, what percent 


disapprove of Mayor Ford? 


. Out of all the age groups, what percent 


approve of Mayor Ford? 


. Find P(Approve| Male). 
. Out of the age groups, what percent are 


more than 44 years old? 
Find P(Approve|Age < 35). 


Explain what is wrong with the following 
statements. Use complete sentences. 


L, 


If there is a 60% chance of rain on 
Saturday and a 70% chance of rain on 


Sunday, then there is a 130% chance of 
rain over the weekend. 

2. The probability that a baseball player hits 
a home run is greater than the probability 
that he gets a successful hit. 


1. You can't calculate the joint probability 
knowing the probability of both events 
occurring, which is not in the information 
given; the probabilities should be 
multiplied, not added; and probability is 
never greater than 100% 

2. A home run by definition is a successful 
hit, so he has to have at least as many 
successful hits as home runs. 


Glossary 


Conditional Probability 
the likelihood that an event will occur given 
that another event has already occurred 


Equally Likely 
Each outcome of an experiment has the same 
probability. 


Event 
a subset of the set of all outcomes of an 
experiment; the set of all outcomes of an 


experiment is called a sample space and is 
usually denoted by S. An event is an arbitrary 
subset in S. It can contain one outcome, two 
outcomes, no outcomes (empty subset), the 
entire sample space, and the like. Standard 
notations for events are capital letters such as 
A, B, C, and so on. 


Experiment 
a planned activity carried out under 
controlled conditions 


Outcome 
a particular result of an experiment 


Probability 
a number between zero and one, inclusive, 
that gives the likelihood that a specific event 
will occur; the foundation of statistics is given 
by the following 3 axioms (by A.N. 
Kolmogorov, 1930’s): Let S denote the sample 
space and A and B are two events in S. Then: 


°° 0< P(A) <1 

¢ If A and B are any two mutually 
exclusive events, then P(A OR B) = P(A) 
+ P(B). 

- P(S) = 1 


Sample Space 
the set of all possible outcomes of an 
experiment 


The Intersection: the AND Event 
An outcome is in the event A AND B if the 
outcome is in both A AND B at the same time. 


The Complement Event 
The complement of event A consists of all 
outcomes that are NOT in A. 


The Conditional Probability of A GIVEN B 
P(A|B) is the probability that event A will 
occur given that the event B has already 
occurred. 


The Union: the OR Event 
An outcome is in the event A OR B if the 
outcome is in A or is in B or is in both A and 
B. 


Two Basic Rules of Probability 


When calculating probability, there are two rules to 
consider when determining if two events are 
independent or dependent and if they are mutually 
exclusive or not. 


The Multiplication Rule 


If A and B are two events defined on a sample 
space, then: P(A NB) = P(B)P(A | B). We can think 
of the intersection symbol as substituting for the 
word "and". 


This rule may also be written as: P(A| B) = P(ANB) 
P(B) 


This equation is read as the probability of A given B 
equals the probability of A and B divided by the 
probability of B. 


If A and B are independent, then P(A|B) = P( 
A).ThenP(ANB) = P(A|B)P(B) becomes 
P(ANB) = P(A)(B) because the P(A|B) = 
P (A) if A and B are independent. 


One easy way to remember the multiplication rule is 
that the word "and" means that the event has to 
satisfy two conditions. For example the name drawn 


from the class roster is to be both a female and a 
sophomore. It is harder to satisfy two conditions 
than only one and of course when we multiply 
fractions the result is always smaller. This reflects 
the increasing difficulty of satisfying two conditions. 


The Addition Rule 


If A and B are defined on a sample space, then: P (A 
UB)=P(A)+P(B)-P(CANB). Wecan 
think of the union symbol substituting for the word 
"or". The reason we subtract the intersection of A 
and B is to keep from double counting elements that 
are in both A and B. 


If A and B are mutually exclusive, then P (ANB) 
= 0.ThenP(AUB)=P(A)+P(B)-P(AN 
B ) becomesP(AUB)=P(A)+P(B). 


Klaus is trying to choose where to go on vacation. 
His two choices are: A = New Zealand and B = 
laska 


¢ Klaus can only afford one vacation. The 
probability that he chooses A is P(A) = 0.6 
and the probability that he chooses B is P(B) 
= 0.35. 

* P( ANB) = O because Klaus can only afford 


to take one vacation 

¢ Therefore, the probability that he chooses 
either New Zealand or Alaska is P( AUB) = 
P(A) +P(B) = 0.6 + 0.35 = 0.95. Note 
that the probability that he does not choose to 
go anywhere on vacation must be 0.05. 


Carlos plays college soccer. He makes a goal 65% 
of the time he shoots. Carlos is going to attempt 
two goals in a row in the next game. A = the event 
Carlos is successful on his first attempt. P(A) = 
0.65. B = the event Carlos is successful on his 
second attempt. P(B) = 0.65. Carlos tends to shoot 
in streaks. The probability that he makes the 


second goal | that he made the first goal is 0.90. 


a. What is the probability that he makes both 
goals? 


a. The problem is asking you to find P( ANB 
) = PC(BNA). Since P(B|A) = 0.90: PBN 
A) = P(B|A) P(A) = (0.90)(0.65) = 0.585 


Carlos makes the first and second goals with 
probability 0.585. 


b. What is the probability that Carlos makes 
either the first goal or the second goal? 
b. The problem is asking you to find P(A U B). 


P(A U B) = P(A) + P(B) - PAN B) = 0.65 + 
0.65 - 0.585 = 0.715 


Carlos makes either the first goal or the second 
goal with probability 0.715. 


c. Are A and B independent? 


c. No, they are not, because P(BM A) = 0.585. 
P(B)P(A) = (0.65)(0.65) = 0.423 
0.423 = 0.585 = P(BN A) 


So, P(B MN A) is not equal to P(B)P(A). 


d. Are A and B mutually exclusive? 


d. No, they are not because P(A MN B) = 0.585. 


To be mutually exclusive, P(A N B) must equal 
zero. 


Try It 


Helen plays basketball. For free throws, she 
makes the shot 75% of the time. Helen must 
now attempt two free throws. C = the event 
that Helen makes the first shot. P(C) = 0.75. D 
= the event Helen makes the second shot. 
P(D) = 0.75. The probability that Helen makes 
the second free throw given that she made the 
first is 0.85. What is the probability that Helen 
makes both free throws? 


P(D|C) = 0.85 


PLC RD) r= ce) 

P(D NC) = P|C)P(C) = (0.85)(0.75) = 
0.6375 

Helen makes the first and second free throws 
with probability 0.6375. 


community swim team has 150 members. 
Seventy-five of the members are advanced 


swimmers. Forty-seven of the members are 
intermediate swimmers. The remainder are novice 
swimmers. Forty of the advanced swimmers 
practice four times a week. Thirty of the 
intermediate swimmers practice four times a week. 
Ten of the novice swimmers practice four times a 
week. Suppose one member of the swim team is 
chosen randomly. 


a. What is the probability that the member is a 
novice swimmer? 


a. 28150 


b. What is the probability that the member 
practices four times a week? 


b. 80150 


c. What is the probability that the member is 
an advanced swimmer and practices four times 
a week? 


c. 40150 


d. What is the probability that a member is an 
advanced swimmer and an intermediate 
swimmer? Are being an advanced swimmer 
and an intermediate swimmer mutually 
exclusive? Why or why not? 


d. P(advanced MN intermediate) = 0, so these 
are mutually exclusive events. A swimmer 
cannot be an advanced swimmer and an 
intermediate swimmer at the same time. 


e. Are being a novice swimmer and practicing 
four times a week independent events? Why or 
why not? 


e. No, these are not independent events. 
P(mnovice /N practices four times per week) 
0.0667 

P(novice)P(practices four times per week) 
0.0996 

0.0667 = 0.0996 


Try It 


A school has 200 seniors of whom 140 will be 
going to college next year. Forty will be going 
directly to work. The remainder are taking a 
gap year. Fifty of the seniors going to college 


play sports. Thirty of the seniors going directly 
to work play sports. Five of the seniors taking 
a gap year play sports. What is the probability 
that a senior is taking a gap year? 


P= 200—140-—40 200 = 20 200 =0.1 


Felicity attends Modesto JC in Modesto, CA. The 
probability that Felicity enrolls in a math class is 
0.2 and the probability that she enrolls in a speech 
class is 0.65. The probability that she enrolls in a 
math class | that she enrolls in speech class is 0.25. 
Let: M = math class, S = speech class, M|S = 
math given speech 


1. What is the probability that Felicity 
enrolls in math and speech? 
Find PMN S) = P(M|S)P(S). 

2. What is the probability that Felicity 
enrolls in math or speech classes? 
Find P(M U S) = P(M) + P(S) - PWNS). 


3. Are M and S independent? Is P(M|S) = 
P(M)? 

4. Are M and S mutually exclusive? Is PIM N 
S) = 0? 


a. 0.1625, b. 0.6875, c. No, d. No 


Try It 


A student goes to the library. Let events B = 
the student checks out a book and D = the 
student check out a DVD. Suppose that P(B) = 
0.40, P(D) = 0.30 and P(D|B) = 0.5. 


1. Find P(B N D). 
2. Find P(B U D). 


1. PBN D) = P(D|B)P(B) = (0.5)(0.4) = 
0.20. 

2. (BUD) = P(B) + P(D) — P(BND) = 
0.40 + 0.30 — 0.20 = 0.50 


Studies show that about one woman in seven 


(approximately 14.3%) who live to be 90 will 
develop breast cancer. Suppose that of those 
women who develop breast cancer, a test is 
negative 2% of the time. Also suppose that in the 
general population of women, the test for breast 
cancer is negative about 85% of the time. Let B = 
woman develops breast cancer and let N = tests 
negative. Suppose one woman is selected at 
random. 


a. What is the probability that the woman 
develops breast cancer? What is the 
probability that woman tests negative? 


a. P(B) = 0.143; P(N) = 0.85 


b. Given that the woman has breast cancer, 
what is the probability that she tests negative? 


b. P(N|B) = 0.02 


c. What is the probability that the woman has 
breast cancer AND tests negative? 


c. P(B AN) = P(B)P(N|B) = (0.143)(0.02) = 
0.0029 


d. What is the probability that the woman has 
breast cancer or tests negative? 


d. P(B UN) = P(B) + PW) - P(BNN) = 
0.143 + 0.85 - 0.0029 = 0.9901 


e. Are having breast cancer and testing 
negative independent events? 


e. No. P(N) = 0.85; P(N|B) = 0.02. So, P(N|B) 
does not equal P(N). 


f. Are having breast cancer and testing 
negative mutually exclusive? 


f. No. P(B MN) = 0.0029. For B and N to be 
mutually exclusive, P(B MN N) must be zero. 


Try It 


A school has 200 seniors of whom 140 will be 
going to college next year. Forty will be going 
directly to work. The remainder are taking a 
gap year. Fifty of the seniors going to college 
play sports. Thirty of the seniors going directly 
to work play sports. Five of the seniors taking 
a gap year play sports. What is the probability 
that a senior is going to college and plays 
sports? 


Let A = student is a senior going to college. 
Let B = student plays sports. 

P(B) = 140 200 

P(B|A) = 50 140 

P(A NB) = P(B\A)P(A) 


P(A NB) = (140 200 )(50 140) = 14 


Refer to the information in [link]. P = tests 
positive. 


1. Given that a woman develops breast 
cancer, what is the probability that she 
tests positive. Find P(P|B) = 1 - P(N|B). 

2. What is the probability that a woman 
develops breast cancer and tests positive. 
Find P(B MN P) = P(P|B)P(B). 

3. What is the probability that a woman 
does not develop breast cancer. Find P(B’) 
= 1 - PCB). 

4. What is the probability that a woman 
tests positive for breast cancer. Find P(P) 


= 1- P(N). 


a. 0.98; b. 0.1401; c. 0.857; d. 0.15 


Try It 


A student goes to the library. Let events B = 
the student checks out a book and D = the 
student checks out a DVD. Suppose that P(B) 
= 0.40, P(D) = 0.30 and P(D|B) = 0.5. 


1. Find P(B’). 

2. Find P(D 1 B). 
3. Find P(B|D). 

4. Find P(D NB’). 
5. Find P(D|B’). 


1. P(B’) = 0.60 

2. P(D NB) = P(D|B)P(B) = 0.20 

3. P(B|D) = P(BND) P(D) = (0.20 ) (0.30) 
= 0.66 


4. P(DN B’) = P(D) - PD NB) = 0.30 - 


0.20 = 0.10 
5. P(D|B)) = P(DN B)P(B) = (PID) - PON 
B))(0.60) = (0.10)(0.60) = 0.06 
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Chapter Review 


The multiplication rule and the addition rule are 
used for computing the probability of A and B, as 
well as the probability of A or B for two given 
events A, B defined on the sample space. In 
sampling with replacement each member of a 
population is replaced after it is picked, so that 
member has the possibility of being chosen more 
than once, and the events are considered to be 
independent. In sampling without replacement, each 
member of a population may be chosen only once, 
and the events are considered to be not 
independent. The events A and B are mutually 
exclusive events when they do not have any 
outcomes in common. 


Formula Review 
The multiplication rule: P(A M B) = P(A|B)P(B) 


The addition rule: P(A U B) = P(A) + P(B)- P(AN 
B) 


Use the following information to answer the next ten 
exercises. Forty-eight percent of all Californians 
registered voters prefer life in prison without parole 
over the death penalty for a person convicted of first 
degree murder. Among Latino California registered 


voters, 55% prefer life in prison without parole over 
the death penalty for a person convicted of first 
degree murder. 37.6% of all Californians are Latino. 
In this problem, let: 

* C = Californians (registered voters) preferring 
life in prison without parole over the death 
penalty for a person convicted of first degree 
murder. 

¢ L = Latino Californians 


Suppose that one Californian is randomly selected. 


Find P(C). 


Find P(L). 


0.376 


Find P(C|L). 


In words, what is C|L? 


C|L means, given the person chosen is a Latino 
Californian, the person is a registered voter who 
prefers life in prison without parole for a person 


convicted of first degree murder. 


Find P(N C). 


In words, what is L N C? 


L 1 C is the event that the person chosen is a 
Latino California registered voter who prefers 
life without parole over the death penalty for a 
person convicted of first degree murder. 


Are L and C independent events? Show why or 
why not. 


Find P(Z U C). 


0.6492 


In words, what is L U C? 


Are L and C mutually exclusive events? Show 
why or why not. 


No, because P(L MN C) does not equal 0. 


(credit: film8ker/wikibooks) 


Homework 


On February 28, 2013, a Field Poll Survey 
reported that 61% of California registered 
voters approved of allowing two people of the 
same gender to marry and have regular 
marriage laws apply to them. Among 18 to 39 
year olds (California registered voters), the 
approval rating was 78%. Six in ten California 
registered voters said that the upcoming 
Supreme Court’s ruling about the 
constitutionality of California’s Proposition 8 
was either very or somewhat important to 
them. Out of those CA registered voters who 
support same-sex marriage, 75% say the ruling 
is important to them. 


In this problem, let: 


* C = California registered voters who 
support same-sex marriage. 

¢ B = California registered voters who say 
the Supreme Court’s ruling about the 
constitutionality of California’s Proposition 
8 is very or somewhat important to them 

* A = California registered voters who are 
18 to 39 years old. 
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. Find P(C). 

. Find P(B). 

. Find P(C|A). 

. Find P(B|C). 

. In words, what is C|A? 

. In words, what is B|C? 

. Find P(C NB). 

. In words, what is C N B? 

. Find P(C U B). 

. Are C and B mutually exclusive events? 


Show why or why not. 


After Rob Ford, the mayor of Toronto, 
announced his plans to cut budget costs in late 
2011, the Forum Research polled 1,046 people 
to measure the mayor’s popularity. Everyone 
polled expressed either approval or disapproval. 
These are the results their poll produced: 


In early 2011, 60 percent of the population 
approved of Mayor Ford’s actions in office. 
In mid-2011, 57 percent of the population 
approved of his actions. 

In late 2011, the percentage of popular 
approval was measured at 42 percent. 


. What is the sample size for this study? 
. What proportion in the poll disapproved of 


Mayor Ford, according to the results from 
late 2011? 


3. How many people polled responded that 
they approved of Mayor Ford in late 2011? 

4. What is the probability that a person 
supported Mayor Ford, based on the data 
collected in mid-2011? 

5. What is the probability that a person 
supported Mayor Ford, based on the data 
collected in early 2011? 


1. The Forum Research surveyed 1,046 
Torontonians. 
2. 58% 
3. 42% of 1,046 = 439 (rounding to the 
nearest integer) 
0557 
5. 0.60. 
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Use the following information to answer the next three 
exercises. The casino game, roulette, allows the 
gambler to bet on the probability of a ball, which 
spins in the roulette wheel, landing on a particular 
color, number, or range of numbers. The table used 
to place bets contains of 38 numbers, and each 
number is assigned to a color and a range. 
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| to 18 EVEN ODD 19 to 36 


1. List the sample space of the 38 possible 
outcomes in roulette. 

2. You bet on red. Find P(red). 

3. You bet on -1st 12- (1st Dozen). Find P(-1st 
12-). 

4. You bet on an even number. Find P(even 
number). 

5. Is getting an odd number the complement 
of getting an even number? Why? 

6. Find two mutually exclusive events. 

7. Are the events Even and 1st Dozen 
independent? 


Compute the probability of winning the 
following types of bets: 


1. Betting on two lines that touch each other 
on the table as in 1-2-3-4-5-6 


2. Betting on three numbers in a line, as in 
1-2-3 

3. Betting on one number 

4. Betting on four numbers that touch each 
other to form a square, as in 10-11-13-14 

5. Betting on two numbers that touch each 
other on the table, as in 10-11 or 10-13 

6. Betting on 0-00-1-2-3 

7. Betting on 0-1-2; or 0-00-2; or 00-2-3 


1. P(Betting on two line that touch each other 
on the table) = 6 38 

2. P(Betting on three numbers in a line) = 3 
38 

3. P(Bettting on one number) = 1 38 

4. P(Betting on four number that touch each 
other to form a square) = 4 38 

5. P(Betting on two number that touch each 
other on the table ) = 2 38 

6. P(Betting on 0-00-1-2-3) = 5 38 

7. P(Betting on 0-1-2; or 0-00-2; or 00-2-3) = 
3 38 


Compute the probability of winning the 
following types of bets: 


1. Betting on a color 
2. Betting on one of the dozen groups 
3. Betting on the range of numbers from 1 to 


18 
4. Betting on the range of numbers 19-36 
5. Betting on one of the columns 
6. Betting on an even or odd number 
(excluding zero) 


Suppose that you have eight cards. Five are 
green and three are yellow. The five green 
cards are numbered 1, 2, 3, 4, and 5. The three 
yellow cards are numbered 1, 2, and 3. The 
cards are well shuffled. You randomly draw one 
card. 


* G = card drawn is green 
¢ E = card drawn is even-numbered 


. List the sample space. 

PG) =. 

P(G|E) = __ 

P(GN EB) = __ 
P(GU EB) = __ 

. Are G and E mutually exclusive? 
Justify your answer numerically. 


Au RWN 


{Gl , G2G3,, G4, Go; Yl, ¥2, 3} 
29:8 
= PABS 
4.28 
5.68 


6. No, because P(G M E) does not equal 0. 


Roll two fair dice separately. Each die has six 
faces. 


1. List the sample space. 

2. Let A be the event that either a three or 
four is rolled first, followed by an even 
number. Find P(A). 

3. Let B be the event that the sum of the two 
rolls is at most seven. Find P(B). 

4. In words, explain what “P(A|B)” 
represents. Find P(A|B). 

5. Are A and B mutually exclusive events? 
Explain your answer in one to three 
complete sentences, including numerical 
justification. 

6. Are A and B independent events? Explain 
your answer in one to three complete 
sentences, including numerical 
justification. 


A special deck of cards has ten cards. Four are 
green, three are blue, and three are red. When a 
card is picked, its color of it is recorded. An 
experiment consists of first picking a card and 
then tossing a coin. 


1. List the sample space. 

2. Let A be the event that a blue card is 
picked first, followed by landing a head on 
the coin toss. Find P(A). 

3. Let B be the event that a red or green is 
picked, followed by landing a head on the 
coin toss. Are the events A and B mutually 
exclusive? Explain your answer in one to 
three complete sentences, including 
numerical justification. 

4. Let C be the event that a red or blue is 
picked, followed by landing a head on the 
coin toss. Are the events A and C mutually 
exclusive? Explain your answer in one to 
three complete sentences, including 
numerical justification. 


he coin toss is independent of the card picked 
first. 


1. {(G,H) (G,T) (B,H) (B,T) (R,A) (R,T)} 

2. P(A) = P(blue)P(head) = (310)(12) 
= 320 

3. Yes, A and B are mutually exclusive 


because they cannot happen at the same 
time; you cannot pick a card that is both 
blue and also (red or green). PAM B) = 0 

4. No, A and C are not mutually exclusive 
because they can occur at the same time. 
In fact, C includes all of the outcomes of A; 
if the card chosen is blue it is also (red or 
blue). PAN C) = P(A) = 320 


An experiment consists of first rolling a die and 
then tossing a coin. 


1. List the sample space. 

2. Let A be the event that either a three or a 
four is rolled first, followed by landing a 
head on the coin toss. Find P(A). 

3. Let B be the event that the first and second 
tosses land on heads. Are the events A and 
B mutually exclusive? Explain your answer 
in one to three complete sentences, 
including numerical justification. 


An experiment consists of tossing a nickel, a 
dime, and a quarter. Of interest is the side the 
coin lands on. 


1. List the sample space. 
2. Let A be the event that there are at least 


two tails. Find P(A). 

3. Let B be the event that the first and second 
tosses land on heads. Are the events A and 
B mutually exclusive? Explain your answer 
in one to three complete sentences, 
including justification. 


1. S = {(HHH), (HHT), (HTH), (HTT), (THH), 
(THT), (TTH), (TTT)} 

2.48 

3. Yes, because if A has occurred, it is 
impossible to obtain two tails. In other 
words, P(A NM B) = 0. 


Consider the following scenario: 
Let P(C) = 0.4. 

Let P(D) = 0.5. 

Let P(C|D) = 0.6. 


1. Find P(C N D). 
2. Are C and D mutually exclusive? Why or 
why not? 
3. Are C and D independent events? Why or 
why not? 
. Find P(C U D). 
5. Find P(D|C). 
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Y and Z are independent events. 


1. Rewrite the basic Addition Rule P(Y U Z) 
= P(Y) + P(Z) - P(Y NZ) using the 
information that Y and Z are independent 
events. 

2. Use the rewritten rule to find P(Z) if P(Y U 
Z) = 0.71 and P(Y) = 0.42. 


1. If Y and Z are independent, then P(Y N Z) 
= P(Y)P(Z), so P(Y U Z) = P(Y) + P(Z) - 
P(Y)P(Z). 

2.30.5 


G and H are mutually exclusive events. P(G) = 
0.5 P(H) = 0.3 


1. Explain why the following statement 
MUST be false: P(H|G) = 0.4. 

2. Find P(H U G). 

3. Are G and H independent or dependent 
events? Explain in a complete sentence. 


Approximately 281,000,000 people over age 
five live in the United States. Of these people, 
55,000,000 speak a language other than English 
at home. Of those who speak another language 


at home, 62.3% speak Spanish. 


Let: E = speaks English at home; E’ = speaks 
another language at home; S = speaks Spanish; 


Finish each probability statement by matching 
the correct answer. 


Probability Answers 
Statements 

a DIELNY ~— 1 M1 QNA2R 
Uent yyy — be ey iv 
hh DILY\ — a ON 499 
we ft (ery —. = abe UeVaiv 

ep DICN EN — a1 O1T0OLT7 
wee kh woeiin y r= 2h: UVUet vv 
d. P(S|E). = iv. 0.1219 
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1994, the U.S. government held a lottery to 
issue 55,000 Green Cards (permits for non- 
citizens to work legally in the U.S.). Renate 
Deutsch, from Germany, was one of 
approximately 6.5 million people who entered 
this lottery. Let G = won green card. 


1. What was Renate’s chance of winning a 


Green Card? Write your answer as a 
probability statement. 

2. In the summer of 1994, Renate received a 
letter stating she was one of 110,000 
finalists chosen. Once the finalists were 
chosen, assuming that each finalist had an 
equal chance to win, what was Renate’s 
chance of winning a Green Card? Write 
your answer as a conditional probability 
statement. Let F = was a finalist. 

3. Are G and F independent or dependent 
events? Justify your answer numerically 
and also explain why. 

4. Are G and F mutually exclusive events? 
Justify your answer numerically and 
explain why. 


Three professors at George Washington 
University did an experiment to determine if 
economists are more selfish than other people. 
They dropped 64 stamped, addressed envelopes 
with $10 cash in different classrooms on the 
George Washington campus. 44% were 
returned overall. From the economics classes 
56% of the envelopes were returned. From the 
business, psychology, and history classes 31% 
were returned. 


Let: R = money returned; E = economics 
classes; O = other classes 
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. Write a probability statement for the 


overall percent of money returned. 


. Write a probability statement for the 


percent of money returned out of the 
economics classes. 


. Write a probability statement for the 


percent of money returned out of the other 
classes. 


. Is money being returned independent of 


the class? Justify your answer numerically 
and explain it. 


. Based upon this study, do you think that 


economists are more selfish than other 
people? Explain why or why not. Include 
numbers to justify your answer. 


. P(R) = 0.44 

. P(R\E) = 0.56 

. P(R|O) = 0.31 

. No, whether the money is returned is not 


independent of which class the money was 
placed in. There are several ways to justify 
this mathematically, but one is that the 
money placed in economics classes is not 
returned at the same overall rate; P(R|E) 

~ P(R). 


. No, this study definitely does not support 


that notion; in fact, it suggests the 
opposite. The money placed in the 
economics classrooms was returned at a 


higher rate than the money place in all 


classes collectively; P(R|E) > P(R). 


The following table of data obtained from 


www.baseball-almanac.com shows hit 


information for four players. Suppose that one 


hit from the table is randomly selected. 


Name Singl> DoubleTripl> Hom: Total 


Babe 1,517 


Jackie 1,054 


Dahin--=— 
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Ty 3,603 
Gebb 
Hank 2,294 
ANanrar 
Total 8,471 


506 136 
273 34 
174 295 
624 98 
1,977, 583 


run 


714 
137 
114 
799 


1,720 


Lia. 
LiLlLD 


2,873 
1,518 
4,189 
3,771 


12,351 


Are "the hit being made by Hank Aaron" and 
"the hit being a double" independent events? 


1. Yes, because P(hit by Hank Aaron|hit is a 
double) = P(hit by Hank Aaron) 


. No, because P(hit by Hank Aaron|hit is a 


double) + P(hit is a double) 


. No, because P(hit is by Hank Aaron|hit is a 


double) + P(hit by Hank Aaron) 


. Yes, because P(hit is by Hank Aaron|hit is 


a double) = P(hit is a double) 


United Blood Services is a blood bank that 
serves more than 500 hospitals in 18 states. 
According to their website, a person with type 
O blood and a negative Rh factor (Rh-) can 
donate blood to any person with any bloodtype. 
Their data show that 43% of people have type 
O blood and 15% of people have Rh- factor; 
52% of people have type O or Rh- factor. 


1. 


2. 


Find the probability that a person has both 
type O blood and the Rh- factor. 

Find the probability that a person does 
NOT have both type O blood and the Rh- 
factor. 


. P(type O U Rh-) = P(type O) + P(Rh-) - 


P(type O N Rh-) 


0.52 = 0.43 + 0.15 - P(type ON Rh-); 
solve to find P(type O N Rh-) = 0.06 


6% of people have type O, Rh- blood 


2. P(NOT(type O MN Rh-)) = 1 - P(type ON 
Rh-) = 1 - 0.06 = 0.94 


94% of people do not have type O, Rh- 
blood 


At a college, 72% of courses have final exams 
and 46% of courses require research papers. 
Suppose that 32% of courses have a research 
paper and a final exam. Let F be the event that 
a course has a final exam. Let R be the event 
that a course requires a research paper. 


1. Find the probability that a course has a 
final exam or a research project. 

2. Find the probability that a course has 
NEITHER of these two requirements. 


In a box of assorted cookies, 36% contain 
chocolate and 12% contain nuts. Of those, 8% 
contain both chocolate and nuts. Sean is 
allergic to both chocolate and nuts. 


1. Find the probability that a cookie contains 
chocolate or nuts (he can't eat it). 

2. Find the probability that a cookie does not 
contain chocolate or nuts (he can eat it). 


1. Let C = be the event that the cookie 
contains chocolate. Let N = the event that 
the cookie contains nuts. 

2. PCC UN) = P(C) + PIN) - P(CNN) = 
0.36 + 0.12 - 0.08 = 0.40 

3. P(NEITHER chocolate NOR nuts) = 1 - P(C 
UN) = 1-0.40 = 0.60 


A college finds that 10% of students have taken 
a distance learning class and that 40% of 
students are part time students. Of the part time 
students, 20% have taken a distance learning 
class. Let D = event that a student takes a 
distance learning class and E = event that a 
student is a part time student 


. Find PDN E). 

. Find P(E|D). 

. Find P(D U E). 

. Using an appropriate test, show whether D 
and E are independent. 

5. Using an appropriate test, show whether D 

and E are mutually exclusive. 
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Glossary 


Independent Events 
The occurrence of one event has no effect on 


the probability of the occurrence of another 
event. Events A and B are independent if one 
of the following is true: 


1. P(A|B) = P(A) 
2. P(B|A) = P(B) 
3. P(A MB) = P(A)P(B) 


Mutually Exclusive 
Two events are mutually exclusive if the 
probability that they both happen at the same 
time is zero. If events A and B are mutually 
exclusive, then P(A M B) = 0. 


Contingency Tables and Tree Diagrams (optional) 


Contingency Tables 


A contingency table provides a way of portraying 
data that can facilitate calculating probabilities. The 
table helps in determining conditional probabilities 
quite easily. The table displays sample values in 
relation to two different variables that may be 
dependent or contingent on one another. Later on, 
we will use contingency tables again, but in another 
manner. 


Suppose a study of speeding violations and drivers 
who use cell phones produced the following 
fictional data: 


Speeding No Total | 
violation in speeding 
the last violation in 
year the last 
1 


Cell phone 25 280 305 


w17caAr 
UveL 


Notacell 45 405 450 
wh ANIA 110 M1" 
Frotal 70 685 755 


The total number of people in the sample is 755. 
The row totals are 305 and 450. The column totals 
are 70 and 685. Notice that 305 + 450 = 755 and 
70 + 685 = 755. 

Calculate the following probabilities using the 
table. 


a. Find P(Person is a car phone user). 


a. number of car phone users 
total number in study = 305 755 


b. Find P(person had no violation in the last 
year). 


b. number that had no violation 
total number in study = 685 755 


c. Find P(Person had no violation in the last 
year AND was a car phone user). 


Ca 2800759 


d. Find P(Person is a car phone user OR person 
had no violation in the last year). 


d:4¢305:755) =} 685.755.) = -280:755 = 
710 755 


e. Find P(Person is a car phone user GIVEN 
person had a violation in the last year). 


e. 25 70 (The sample space is reduced to the 
number of persons who had a violation.) 


f. Find P(Person had no violation last year 
GIVEN person was not a car phone user) 


f. 405 450 (The sample space is reduced to the 
number of persons who were not car phone 


users. ) 


Try it 


[link] shows the number of athletes who 
stretch before exercising and how many had 
injuries within the past year. 


Injury in Noinjur; Total 
last year in last 


Stretches|—_S5& 205, SEO 
Does not 231 219 450 
otratoah 

Total 286 514 800 


1. What is P(athlete stretches before 
exercising)? 

2. What is P(athlete stretches before 
exercising|no injury in the last year)? 


1. P(athlete stretches before exercising) = 


350 800 = 0.4375 

2. P(athlete stretches before exercising|no 
injury in the last year) = 295 514 = 
0.5739 


[link] shows a random sample of 100 hikers and 
the areas of hiking they prefer. 


Sex The Near On Total | 
Coastlii1e Lakes Mountain 

| and Peaks | 

Streams 

Female+—_1¢ 1G 45 

Male Vv BE | 

FTotal — 41 — — 


Hiking Area Preference 


a. Complete the table. 


Sex The Near On Total 
Coastl nd@Lakes Mountaiui 
and Peaks 


Lamala 190 14 11 AE 
re otereene Lu LU as TU 
Nala 14 9E 1A Ee 
LVLULE au au a Vu 
Total 34 41 25 100 


Hiking Area Preference 


b. Are the events "being female" and 
"preferring the coastline" independent events? 


Let F = being female and let C = preferring 
the coastline. 


1. Find P(F AND C). 
2. Find P(F)P(C) 


Are these two numbers the same? If they are, 
then F and C are independent. If they are not, 
then F and C are not independent. 


1. P(F AND C) = 18 100 = 0.18 
2. P(F)P(C) = (45 100 )( 34 100 ) = (0.45) 
(0.34) = 0.153 


P(F AND C) ~ P(F)P(C), so the events F and C 


are not independent. 


c. Find the probability that a person is male 
given that the person prefers hiking near lakes 
and streams. Let M = being male, and let L = 
prefers hiking near lakes and streams. 


1. What word tells you this is a conditional? 

2. Fill in the blanks and calculate the 
probability: PL_|_) = __ 

3. Is the sample space for this problem all 
100 hikers? If not, what is it? 


1. The word 'given' tells you that this is a 
conditional. 

2. P(M|L) = 25 41 

3. No, the sample space for this problem is 
the 41 hikers who prefer lakes and 
streams. 


d. Find the probability that a person is female 
or prefers hiking on mountain peaks. Let F = 
being female, and let P = prefers mountain 


peaks. 


. Find P(F). 

. Find P(P). 

. Find PF AND P). 
. Find P(F OR P). 
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. P(F) = 45 100 
. P(P) = 25 100 


. PF AND P) = 11 100 
. PF OR P) = 45 100 + 25 100 - 11 100 
= 59 100 


Try It 


[link] shows a random sample of 200 cyclists 
and the routes they prefer. Let M = males and 
H = hilly path. 


Gender Lake Hilly | Wooded Total 


n.24L n24L nA4lL 
GFatil Fatil QKFatil 


eS Pe he 20 27 110 
hb YeLLLIUSL = i aS) vu aif ity 
Mala 96 ro 19 an 
L1VLUL NV aI Va ii wv 
Total 71 90 39 200 


1. Out of the males, what is the probability 
that the cyclist prefers a hilly path? 

2. Are the events “being male” and 
“preferring the hilly path” independent 
events? 


. P(H|M) = 52 90 = 0.5778 
. For M and H to be independent, show 
P(H|M) = PCE) 


P(H|M) = 0.5778, PCH) = 90 200 = 0.45 


P(H|M) does not equal P(H) so M and H 
are NOT independent. 


Muddy Mouse lives in a cage with three doors. If 
Muddy goes out the first door, the probability that 
lhe gets caught by Alissa the cat is 1 5 and the 
probability he is not caught is 45 . If he goes out 
the second door, the probability he gets caught by 

lissa is 1 4 and the probability he is not caught is 
3 4. The probability that Alissa catches Muddy 


coming out of the third door is 1 2 and the 
probability she does not catch Muddy is 1 2. It is 
equally likely that Muddy will choose any of the 
three doors so the probability of choosing each 
door is13. 


caught Door O 1eDoor Door Total 


mae ATR Tear n ya E. BAS 

wi IWUL ivvvw LLILCTO 
Casiaht T1E 1 19 14 

Not 415 342 16 

Caidcht 

Muugsie 

co 1 


Door Choice 


* The fist entry 1 15 — C15.) Ci 3)1s P(Door 
One AND Caught) 

* The entry 415 = (45 )( 13) is P(Door One 
AND Not Caught) 


Verify the remaining entries. 


a. Complete the probability contingency table. 
Calculate the entries for the totals. Verify that 
the lower-right corner entry is 1. 


 Weehiereaes tb LY i iw | 


Caught Door Door Door Total 


=== ALA £ (OOS me .- Lan FP 

ul IWUL Vile ivvvu LLILTCT 
Gaught—1-Ls sie oe LOSO 
Not 415 ro 16 4160 
Camicht 

Muugsie 

Total 515 412 26 1 


Door Choice 


b. What is the probability that Alissa does not 
catch Muddy? 


c. What is the probability that Muddy chooses 


Door One OR Door Two given that Muddy is 
caught by Alissa? 


[link] contains the number of crimes per 100,000 
inhabitants from 2008 to 2011 in the U.S. 


Tin an |) Re W72L2.1~2 M2421 [ 
I: Cal INU DL YOULL SL AL yricipe VUELLIL LE LULGAL | 
wravaye) i | lar A 72391 90 7 DA. 
pee Dh eVes £ Qs dete aitYels wWwtitie/s | 
9NNA0 a Be te me | Lay fa Vay Mia J 90 1 9EaQ 9 

AS ee ee oe , ao An A Si4~eo Rh tems SF 0 hl | 
9N1N 110 9 T7011 ek 920 1 

CS eS ee ee ee ae “ue aif elf ai Yet | 
9011 i Es Ne Miley J T7099 9640 990 4 

CS ee ee Se oe oS ee Ae or ee | SIV eWYJ kel Sort | 
pho 


United States Crime Index Rates Per 100,000 
Inhabitants 2008-2011 


TOTAL each column and each row. Total data 
= 4) slay 


1. Find P(2009 AND Robbery). 
2. Find P(2010 AND Burglary). 
3. Find P(2010 OR Burglary). 
4. Find P(2011|Rape). 

5. Find P(Vehicle|2008). 


ay 0.02947), 0.1551.c. 0.7165, d. 0.2365, e- 


O23 3 


Try It 


[link] relates the weights and heights of a 
group of individuals participating in an 
observational study. 


Weight/ Tall Mediun Short Totals 


THA2 Li 
LATIBIAL 
Obese 18 22 14 
Neormet—26 Bt 2e 
TImdawt.rAtt + 9E& a 
WYiLiIUCL eT om te Pate) gf 
Totals 
1. Find the total for each row and column 


. Find the probability that a randomly 


chosen individual from this group is Tall. 


. Find the probability that a randomly 


chosen individual from this group is 
Obese and Tall. 


. Find the probability that a randomly 


chosen individual from this group is Tall 
given that the idividual is Obese. 


. Find the probability that a randomly 


chosen individual from this group is 
Obese given that the individual is Tall. 


. Find the probability a randomly chosen 


7. 


individual from this group is Tall and 
Underweight. 

Are the events Obese and Tall 
independent? 


Weight/ Tall Mediun Short Totals 


TI¥A2:-L i 

LALTIBIAL 

OMhaca 19 9Q AN 
Vvueve Lu au vu 
Narmo 90 | (eye) 
poe BS 2 ie © & 0S ie aI VU wos at a 
TIndar.roithAt 9Fe a AB 
V1 V16=-t au gy mw 
Totals 50 104 51 205 


. Row Totals: 60, 99, 46. Column totals: 50, 


104, 51. 


. P(Tall) = 50 205 = 0.244 

. P(Obese AND Tall) = 18 205 = 0.088 
. P(Tall|Obese) = 18 60 = 0.3 

. P(Obese|Tall) = 18 50 


0.36 


. P(Tall AND Underweight = 12 205 = 
0.0585 
. No. P(Tall) does not equal P(Tall|Obese). 


Tree Diagrams 


Sometimes, when the probability problems are 
complex, it can be helpful to graph the situation. 
Tree diagrams can be used to visualize and solve 
conditional probabilities. 


Tree Diagrams 


A tree diagram is a special type of graph used to 
determine the outcomes of an experiment. It 
consists of "branches" that are labeled with either 
frequencies or probabilities. Tree diagrams can 
make some probability problems easier to visualize 
and solve. The following example illustrates how to 
use a tree diagram. 


In an urn, there are 11 balls. Three balls are red (R) 
and eight balls are blue (B). Draw two balls, one at 
a time, with replacement. "With replacement" 
means that you put the first ball back in the urn 
before you select the second ball. The tree diagram 
using frequencies that show all the possible 
outcomes follows. 

Total = 64 + 24+ 24+9=121 


The first set of branches represents the first draw. 
The second set of branches represents the second 
draw. Each of the outcomes is distinct. In fact, we 
can list each red ball as R1, R2, and R3 and each 
blue ball as B1, B2, B3, B4, B5, B6, B7, and B8. 
Then the nine RR outcomes can be written as: 
1R1 R1R2 R1R3 R2R1 R2R2 R2R3 R3R1 R3R2 
3R3 
The other outcomes are similar. 
There are a total of 11 balls in the urn. Draw two 
balls, one at a time, with replacement. There are 
11(11) = 121 outcomes, the size of the sample 
space. 


a. List the 24 BR outcomes: B1R1, B1R2, B1R3, 


a. B1R1 B1R2 B1R3 B2R1 B2R2 B2R3 B3R1 
B3R2 B3R3 B4R1 B4R2 B4R3 BSR1 BSR2 
BSR3 B6R1 B6R2 B6R3 B7R1 B7R2 B7R3 


B8R1 B8R2 B8R3 


b. Using the tree diagram, calculate P(RR). 


bertR R= (ells) Ga dlii—O 12 


c. Using the tree diagram, calculate P(RB OR 
BR). 


GC P(RB OR BR )e—- (3 1 ys Al) srs 1} 
311) = 48121 


d. Using the tree diagram, calculate P(R on 1st 
draw AND B on 2nd draw). 


d. P(R on 1st draw AND B on 2nd draw) = 
PURE a= 63 TLC elitr 4 1 


e. Using the tree diagram, calculate P(R on 
2nd draw GIVEN B on Ist draw). 


e. P(R on 2nd draw GIVEN B on Ist draw) = 
P(R on 2nd|B on Ist) = 24 88 = 311 


This problem is a conditional one. The sample 
space has been reduced to those outcomes that 
already have a blue on the first draw. There 
are 24 + 64 = 88 possible outcomes (24 BR 
and 64 BB). Twenty-four of the 88 possible 
outcomes are BR. 24 88 = 311. 


f. Using the tree diagram, calculate P(BB). 


f. P(BB) = 64121 


g. Using the tree diagram, calculate P(B on the 
2nd draw given R on the first draw). 


g. P(B on 2nd draw|R on 1st draw) = 811 


There are 9 + 24 outcomes that have R on the 
first draw (9 RR and 24 RB). The sample space 
is then 9 + 24 = 33. 24 of the 33 outcomes 
have B on the second draw. The probability is 
then 24 33. 


Try It 


In a standard deck, there are 52 cards. 12 
cards are face cards (event F) and 40 cards are 
not face cards (event N). Draw two cards, one 
at a time, with replacement. All possible 
outcomes are shown in the tree diagram as 
frequencies. Using the tree diagram, calculate 


P(FF). 


ist Draw 
40N 


vax 2nd Draw 


12F 40N 12 40N 


144FF 480FN 480NF 1,600NN 


Total number of outcomes is 144 + 480 + 
480 + 1600 = 2,704. 


P(FF) = 144 144 + 480 + 480 + 1,600 = 
144 2,704 = 9169 


n urn has three red marbles and eight blue 
marbles in it. Draw two marbles, one at a time, this 
time without replacement, from the urn. "Without 


replacement" means that you do not put the first 
ball back before you select the second marble. 
Following is a tree diagram for this situation. The 
branches are labeled with probabilities instead of 
frequencies. The numbers at the ends of the 
branches are calculated by multiplying the 
numbers on the two corresponding branches, for 
example, (3 11)(210)= 6110. 

Total = 56+24+24+6 110 = 110110 =1 


1st Draw 
B R 
8 3 
11 ill 
B R B R 2nd Draw 
ae ne es 
10 10 10 10 
56 24 24 6 
110 110 110 110 
BB BR RB RR 


OTE 
If you draw a red on the first draw from the three 
ed possibilities, there are two red marbles left to 
draw on the second draw. You do not put back or 
eplace the first marble after you have drawn it. 


ou draw without replacement, so that on the 
second draw there are ten marbles left in the urn. 


Calculate the following probabilities using the tree 
diagram. 


a. P(RR) = 


a. P(RR) = (311 )(210)= 6110 


b. Fill in the blanks: 
P(RB OR BR) = (311 )(810) + C_)C_) = 
48 110 


b. P(RB OR BR) = (311 )( 810) + (811 ) 
310) = 48110 


c. P(R on 2nd|B on Ist) = 


c. P(R on 2nd|B on 1st) = 310 


d. Fill in the blanks. 
P(R on 1st AND B on 2nd) = P(RB) = (_) 
(=) =-24:100 


d. P(R on 1st AND B on 2nd) = P(RB) = (3 
11 )( 810) = 24100 


e. Find P(BB). 


e. P(BB) = (811 )( 710) 


f. Find P(B on 2nd|R on 1st). 


f. Using the tree diagram, P(B on 2nd|R on 1st) 
= P(R|B) = 810. 


If we are using probabilities, we can label the tree 
in the following general way. 


P(B) P(R) 


P(B| B) P(R|B) P(B| R) P(R|R) 


(B AND B)=P(BB) P(B AND R)=P(BR) P(RAND B)=P(RB) P(R AND R)=P(RF 


* P(R|R) here means P(R on 2nd|R on 1st) 
- P(B|R) here means P(B on 2nd|R on 1st) 
* P(R|B) here means P(R on 2nd|B on 1st) 
- P(B|B) here means P(B on 2nd|B on 1st) 


Try It 


In a standard deck, there are 52 cards. Twelve 
cards are face cards (F) and 40 cards are not 
face cards (N). Draw two cards, one at a time, 
without replacement. The tree diagram is 
labeled with all possible probabilities. 


an come 


F N 
42 40 
52 52 
ip N r N 2nd Draw 
iA. 40 12 39 
sul 51 Sil 51 
132 480 480 1,560 
2,652 2,652 2,652 2,652 
FF FN NF NN 


1. Find P(FN OR NF). 

2. Find P(N|F). 

3. Find P(at most one face card). 
Hint: "At most one face card" means zero 
or one face card. 

4. Find P(at least on face card). 
Hint: "At least one face card" means one 
or two face cards. 


. PEN OR NF) = 480 2,652 + 480 2,652 
=) 960 2,.652)— "60 221 

. P(N|F) = 4051 

. P(at most one face card) = 
(480 + 480 + 1,560) 2,652 = 2,520 


2,652 
. P(at least one face card) = 
(132 + 480 + 480) 2,652 = 1,092 2,652 


litter of kittens available for adoption at the 
Humane Society has four tabby kittens and five 
black kittens. A family comes in and randomly 
selects two kittens (without replacement) for 
adoption. 


1st Kitten 
, B 
= i) 
9 9 
T B T B 2nd Kitten 
3 at ae 2 
8 8 8 8 
TT TB BT BB 


1. What is the probability that both kittens 
are tabby? 


a.(12)(12)b.(049)(49)c.(49 ( 
38)d.(49)(59) 
2. What is the probability that one kitten of 


each coloring is selected? 


a.(49)(59)b.(49)(58)c.(49 \( 
59)+(59)(49)d.(49)(58)4+(59 
(48) 

3. What is the probability that a tabby is 
chosen as the second kitten when a black 
kitten was chosen as the first? 

4. What is the probability of choosing two 
kittens of the same color? 


ACC DadiCuArOn dase he 


Try It 


Suppose there are four red balls and three 
yellow balls in a box. Three balls are drawn 
from the box without replacement. What is the 
probability that one ball of each coloring is 
selected? 


C47 C3 6) + 03:7 )C4-6)) 
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Chapter Review 


There are several tools you can use to help organize 
and sort data when calculating probabilities. 
Contingency tables help display data and are 
particularly useful when calculating probabilites 
that have multiple dependent variables. 


A tree diagram use branches to show the different 
outcomes of experiments and makes complex 
probability questions easy to visualize. 


Glossary 


Tree Diagram 
the useful visual representation of a sample 
space and events in the form of a “tree” with 
branches marked by possible outcomes 
together with associated probabilities 
(frequencies, relative frequencies) 


Contingency Table 
the method of displaying a frequency 
distribution as a table with rows and columns 
to show how two variables may be dependent 
(contingent) upon each other; the table 
provides an easy way to calculate conditional 
probabilities. 


Introduction to Discrete Probability Distributions 


Introduction 


Chapter Objectives 


By the end of this chapter, the student should be 
able to: 


* Recognize the binomial probability distribution 
and apply it appropriately. 

* Be able to evaluate evidence using the binomial 
distribution. 


Suppose you flip a coin ten times and each time it 
comes up heads. This might make you start to 
wonder if there is something wrong with the coin. 
Perhaps it is a trick coin and is heads on both sides? 
Perhaps it is imbalanced and it is more likely to 
come up heads over tails. You may also wonder 
what is the probability of getting 10 heads in a row, 
if the coin was fair. 


Coin flipping is interesting because it is a random 
event. We cannot predict whether the next flip will 
be heads or tails (assuming it isn’t a trick coin). That 
means the outcome would be a random variable. A 
random variable is any variable where the 


outcome is determined by a random event. The 
outcome is also discrete because we count it. Above 
you flipped a coin ten times and counted the 
number of heads. A discrete random variable is a 
variable whose outcome is determined by a random 
event and where we count the outcomes. Other 
examples of discrete random variables include how 
many times you roll an even number with a die out 
of ten rolls; how many customers enter a store 
during a five-minute interval; how many times you 
draw a high card out of a deck of cards out of eight 
draws (without replacement). 


In each of these situations (coin toss, rolling die, 
number of customers, drawing cards), you could 
look at each situation and, each time, come up with 
a new formula to find the probability of these events 
happening. But this would take a lot of work and be 
inefficient. Instead, you would want to see if the 
situation can be modelled by a distribution. A 
probability distribution provides the theoretical 
probabilities of all of the possible events in a 
situation. For example, the following is a probability 
model of how many heads you can get when you 
flip a fair coin three times: 
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Notice that the probabilities range from 0 to 1 and 
that the sum of the probabilities is 1. 


The above table could be determined by working 
out all of the possible outcomes (TTT, TTH, THT, 
HTT, etc.), then counting how many heads were in 
each outcome. But again, that is time consuming. 
Instead, you want to see if there is a probability 
distribution that models your situation that you can 
use. For example, coin tossing can be modelled by 
the T distribution. 


The binomial distribution is an example of a model 
for discrete random variables. There are many other 
models for discrete random variables including 
Poisson, geometric, hypergeometric, and discrete 
uniform to name a few. Each distribution comes 
with a set of criteria and if a situation fits that 
criteria, then the distribution can model it. That is, 
the distribution can produce theoretical 
probabilities for that situation. 


Theoretical vs. experimental probabilities 
In simplest terms, a theoretical probability is 


determined by using a formula while an 
experimental probability is found by actually doing 
the event. For example, if you flip a coin 3 times 
and get 2 heads, then the experimental probability 
is 2/3 = 0.6667. The theoretical probability is 
0.375. The theoretical probability is also called the 
long-run probability, because the longer you do 
the experimental probability the closer the 
experimental results will get to the theoretical 
probability. This is an example of the law of large 
mumbers. 


In this chapter, we are going to learn about the 
binomial distribution, which is a model for discrete 
random variables. In the next chapter, we will learn 
about the normal distribution, which is a model for 
continuous random variables. 


In particular, we want to use the binomial 
distribution to evaluate evidence. For example, 
going back to the example flipping a coin ten times 
and getting ten heads, we want to use the evidence 
(getting ten heads) to determine whether we think 
there is something wrong with the coin. 


Binomial distribution and introduction to evaluating 
evidence 

An introduction to the binomial distribution an 
informal inferential statistics. 


Binomial distribution 


Flipping a coin a certain number of times, let’s say 
ten times, is a classic example of a binomial 
distribution. What are the characteristics of flipping 
a coin that makes it binomial? 


Before we answer that question, let’s get a bit 
terminology out of the way. In probability theory, 
an experiment is the actual process that you 
investigating. In the flipping coin example, the 
experiment is flipping a coin ten times. A trial is a 
specific instance of an experiment. Flipping a coin 
only once is considered a trial. 


Going back to the coin flipping example, let’s 
assume that we are dealing with a fair coin (i.e. 
probability of getting a head or a tail is 50%). We’ve 
already discussed that when we count the number of 
heads that this is an example of a discrete random 
variable. Notice that there are only two possible 
outcomes (heads or tails). This is a key criterion for 
a binomial distribution (binomial derives from Latin 
for two terms). Also, notice that the events are 


independent of each other. That is, if you get a head 
on one flip, that has no impact on the probability of 
getting a head on the next flip. This also means that 
the probability of getting a head remains constant. 
This is another key criterion for a binomial 
distribution. The other thing to notice is that the 
number of times we flip the coin is fixed. We don’t 
flip it until we get bored or run out of time. Instead 
we flip it ten times. This means that the number of 
trials is fixed. This is the last key criterion of a 
binomial distribution. 


There are five characteristics of a binomial 
experiment. 


1. The variable being studied is random. 

2. The outcomes of the variable are being 
counted. 

3. There are a fixed number of trials. The letter n 
denotes the number of trials. 

4. There are only two possible outcomes, called 
"success" and "failure," for each trial sx denotes 
the probability of a success on one trial, and 1- 
mz denotes the probability of a failure on one 
trial. 

5. The n trials are independent and are repeated 
using identical conditions. Because the n trials 
are independent, the outcome of one trial does 
not help in predicting the outcome of another 
trial. Another way of saying this is that for each 
individual trial, the probability, a , of a success 


and probability, 1- x , of a failure remain 
constant. 


Other examples of binomial distributions: 


Counting the number of 2’s that are rolled 
when you roll a die six times (trial = rolling a 
dice, success = rolling a 2,n = 6,m = 1/6 = 
0.1667) 

Counting the number of times a jack is pulled 
out of deck of cards (with replacement) when 
you pull a card fifteen times (trial = pulling a 
card, success = pulling a jack, n = 15,7 = 
4/52 = 0.0769) 

Counting the number of times that you win a 
prize in Tim Hortons Roll up the Rim to Win 
contest out of four cups (trial = checking cup 
for win, success = winning a prize, n = 4, 1 = 
1/6 = 0.1667 — assuming no special rules (e.g. 
anniversary rules that changed the odds of 
winning) 


Examples of situations that are not binomial 
include: 


* Counting the number of times a jack is pulled 


out of deck of cards (without replacement) 
when you pull a card fifteen times. The fifth 
criterion is not met because the events are now 
dependent. 


* Counting the number times each number (1 to 


6) is rolled when you roll a die fifty times. The 
fourth criterion is not met because there are six 
possible outcomes instead of two. 

Counting the number of times that you win a 
prize in Tim Horton’s Roll up the Rim to Win 
contest out of how many cups you buy during 
the contest. Unless you know exactly how 
many you'll buy during the contest, this would 
not meet the third criterion of having a fixed 
number of trials. 


The Roll up the Rim example might not be 
binomial as it may fail the fifth criterion. At the 
beginning of the contest, the odds of winning are 
determined by counting how many prizes there are 
out of the total number of cups printed. As the 
contest goes on, the probability of winning may 


change depending on how many people have 
already won. At the beginning of the contest, this is 
also true but there are so many cups that it doesn’t 
really matter (think back to the sampling with 
replacement vs. without replacement in Chapter 1). 
Thus, this contest is only binomial at the beginning 
of the contest. 


Notation 


Suppose we are working on a probability question 
and there are multiple probabilities that need to be 
found. Then it gets time consuming to write out, for 
example, “the probability that three rolls of a die 
will result in at least one 2” or some variation over 
and over again. Instead we will use notation to 
reduce the work. We can write the previous 
statement more quickly as PX = 1). The P( ) means 
the “probability of”. X is the random variable being 
studied (in this case the number of times 2 has been 
rolled out of 3 rolls). “X = 1” means we are looking 
at the number of times a 2 is rolled at least once. 


It is important to define X. Otherwise, P(2<X< 5) 


could refer to any random variable and the person 
reading the notation won’t know what it means. 


Mean and standard deviation of the binomial 
distribution 


Just like a set of data, a binomial distribution has a 
mean and a standard deviation. For the binomial 
distribution, these are given by the formulas: 
u=nrt 


o=nz (1-7 ) 


Going back to the Tim Hortons example, we had 


n=4 and x = 0.1667. Thus u=4 X 0.1667 = 
0.6667 and o=4 x 0.1667 x (1 — 0.1667) = 0.745. 
This means that if we buy four random cups of Tim 
Hortons coffee during the Roll Up the Rim content, 
we will typically win 0.67 times, give or take 0.75. 
Thus, when buying four cups of coffee, we will 
typically win between -0.08 and 1.42 times. Since 
we can’t win negative times, we will round the 
lower bound to 0. Therefore, when buying four cups 
of coffee, we will typically win between 0 and 1.42 
times. 


A market research study shows that 30% of all 
passengers on Canadian Airlines are business 
travelers. A random sample of 20 passengers is 
taken. 


1. Explain why the above situation satisfies 
the criteria of a binomial distribution. If 
there are any issues with why this situation 
may not meet all of the criteria, discuss 
them. Define n, X and x. 

2. For the random sample, determine the 
probability that: 


1. Exactly seven of the passengers are 
business travelers. 

2. From ten to fourteen (inclusive) are 
business travelers. 

3. At least eleven of these passengers are 
business travelers. 


. Five of these passengers are NOT travelling 
on business. 


. What is the typical range of business 
passengers in a random sample of 20? 


. The situation is a binomial distribution 
because: 


It represents a random variable as the 
sample is randomly selected. 

It is a discrete variable as we are counting 
the number of business travellers. X is the 
number of business travellers in the 
sample. 

There is a fixed number of trials (n = 20) 
There are only two options: the passenger 
is a business traveller (success) or they are 
not a business traveller (failure). 

In a random sample whether one passenger 
is a business traveller does not affect the 
probability of another passenger being a 
business traveller. Therefore, the 
probability of success remains constant: st 
= 30% = 0.3 


. Use a computer program to come up with 
the following output. 
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20 


1. P(X=7) = 0.16426 


2. P(10 < X < 14) = 0.04792 (highlight the 


values in the column P(X) for X from 10 to 


14, then look at the Sum in the lower 


right) 
3. P(X = 11) = 0.01714 (highlight the 


values in the column P(X) for X from 11 


and higher, then look at the Sum in the 


lower right) 

4. This changes zt to 0.7, then re-run the 
computer program. Look at when X is 5. 
P(X=5) = 0.00004 


1. The mean is the same as the expected 
value, which is 6.0 and the standard 
deviation is 2.049. This gives us a typical 
range of 3.951 and 8.049 for the typical 
number of business passengers in a random 
sample of 20 passengers. 


Evaluating evidence using the binomial 
distribution 


A company looked at its hiring practices. In 
particular, they found that their hiring practices 
appears to favour men over women. Based on past 
data, they have found that regardless of the number 
of applications by women, seventy-five percent of 
hires are men. Due to this issue, they decide to 
implement program. In this program, the name and 
any identifying features that may indicate the 
gender of an applicant are removed. For example, if 
the application says, “She executed a marketing 
campaign that increased revenue by 30%”, this 
would be changed to “They executed a marketing 
campaign that increased revenue by 30%.” The 


names on the applications were changed an alpha- 
numeric identification (like AB-101). The company 
claims that the program has worked, but they want 
to check the claim. 


How will the company determine if the program has 
worked? One way to do this would be using 
statistics. 


Now suppose that after a recent round of hiring, the 
proportion of men hired was 70%. Would this be 
enough evidence that the program is working? 70% 
is definitely lower than 75%, but we know that 
there is variability in sampling. This means that, 
prior to the program being implemented, around 
75% of hires are men, there may be some rounds of 
hiring where 70% of hires were men and some that 
were 80%. It won’t be 75% each time. Instead we 
expect it to be close to 75%. Therefore, if the 
program has caused the hiring practices to change, 
would a recent round of hiring that results in the 
proportion of men hires being 70% be enough 
evidence of change? What about 60%? What is the 
line between normal variability from 75% and 
abnormal variability? Statistics helps us figure that 
out and that is how we evaluate evidence using 
statistics. 


Let’s say in a recent round of hires there were 30 
new hires and 20 of these hires were men. 


Skepticism 


Any time we are trying to evaluate evidence, we 
always start from a position of skepticism. That is, 
we don’t want to assume what we are trying to 
show (i.e the claim). If we do that, we may bias the 
investigation. To illustrate, if you assume that your 
significant other is cheating on you, then this will 
colour all of the evidence you find (why did they 
show up five minutes late from work? They must be 
cheating!). A well-known real-world example of this 
position is the assumption in court that a defendant 
is innocent until proven guilty. That is, criminal 
court cases start with the assumption of innocence. 


In general, the position of skepticism is that nothing 
has changed, the program didn’t work, the 
experiment didn’t work, the effect being studied 
isn’t happening, etc. 


In our example, we will assume that the program 
that the company implemented did not work. That 
is, we are assuming that the proportion of hires that 
are men is still 75%. Another way of writing this is 
mw = 75% (i.e. the population proportion). 


Evidence 


In a court case, evidence would be witness 
testimony, forensics evidence, expert testimony, etc. 


In statistics, evidence is sample data. The evidence 
has been collected to evaluate the claim. In this 
case, the evidence has been evaluated to determine 
if the program is working. 


In our example, the sample data is the 20 men hires 
out of 30. This gives us a sample proportion of 
20/30 = 66.67%. The symbol for sample proportion 
is p “ (said “p hat” - the symbol above the p is 
supposed to be """, but the online textbook program 
does not properly show it). 


Evaluating evidence in statistics 


To evaluate the evidence, we want to determine the 
probability of observing the evidence (or even better 
evidence against the assumption) assuming the 
assumption is true. Once we determine this 
probability, we need to determine if the event is 
unlikely or not unlikely. That is, we want to 
determine if it unlikely we could have observed the 
evidence, if the assumption is true. Or is it not 
unlikely that we observed the evidence, if the 
assumption is true. If it is unlikely to have observed 
the evidence, then most likely there is something 
wrong with the assumption and the claim is likely 
true. If it is not unlikely to have observed the 
evidence, then we can’t actually conclude that there 
is something wrong with the assumption and we 
cannot conclude that the claim is true. 


To go back to the court case example, if you are a 
juror, you have to evaluate how unlikely or not 
unlikely it is that the defendant would have had a 
heated argument with the victim, and was found 
covered in blood and holding the murder weapon at 
the scene, if the defendant was innocent. If you 
think that it is unlikely that all of the pieces of 
evidence could have happened if the defendant is 
innocent, then you would find the defendant guilty. 
That is, the evidence calls into question the 
assumption. If you think that this it is not unlikely 
that all of these pieces of evidence could have 
happened if the defendant is innocent, then you 
would find the defendant not guilty. Notice that we 
don’t conclude that the defendant is innocent. That 
is, we can’t say that they are innocent; we can only 
say that they are not guilty. 


hen evaluating evidence, we are trying to 
evaluate the claim (i.e. not the position of 
skepticism). Therefore, the evidence has been 


collected about claim. No evidence has been 
collected about the assumption. Therefore, our 
conclusion can only be about the claim and not the 
assumption. 


Therefore, if the probability is small and therefore 


unlikely, we can say that there is enough evidence 
to suggest that the assumption is likely false (i.e. 
guilty). 


If the probability is not small and therefore not 
unlikely, we can say that there is not enough 
evidence to suggest that the assumption is false (i.e. 
not guilty). 


In statistics, if the probability of an event 
happening is less than 1%, we say that the event is 
unlikely to happen. If the probability is greater 
than 10%, we say the event is not unlikely to 
happen. If the probability is between 1% and 10%, 


then it is up to the researcher to determine 

hether they believe that the event is unlikely or 
not unlikely. Usually, the researcher decides on the 
threshold between unlikely and not unlikely before 
performing the experiment or study. 


In our example, to evaluate the evidence, we want 
to work out what is the probability this company 
would have hired 20 men out 30 (or even better 
evidence against the assumption) if the proportion 
of men hires is still 75%. That is, we want to find 
P(X < 20) , given m = 75%). Notice that this is a 
conditional probability and the condition is the 


assumption. 


What does “or even better evidence against the 
assumption” mean? It means that we don’t just find 
the probability of exactly 20 out of 30 men hires. 
We find the probability of at most 20 out of 30 men 
hires because if the company hired 19, 18, 17, ... 
men then that would be even better evidence that 
75% is no longer correct (as the sample proportion 
is getting more and more different from the assumed 
population proportion). 


Why do we look at “or even better evidence against 
the assumption”? Often the probability of exactly 
one event happening is quite small. For example, 
the probability of getting exactly 10 heads out of 20 
coin tosses is 17.62%, even though that is the most 
likely event to occur. Therefore, if we only looked at 
the probability of exactly one event happening (i.e. 
P(X =20)) rather than P(X < 20) , we may come to 
the false impression that an event is unlikely, when 
it could actually be explained by normal sampling 
variability. 


Finding the probability 


To find the probability, we need to find an 
appropriate distribution that models the situation. 
In later chapters, we will look at other models. 
Right now, the model we are going to use is the 
binomial distribution. For us to use this model, we 


have to ensure that the situation is meeting all of 
the conditions of the binomial distribution. 


1. The variable being studied is random: This is not 
necessarily the case here as the applicants are 
not random and the hiring process is not 
random. If we randomly selected 30 hires from 
a greater number of hires, then it would be. 

2. The outcomes of the variable are being counted: 
We are counting the number of men hired. 

3. There are a fixed number of trials: We are 
looking at 30 hires (n = 30) 

4. There are only two possible outcomes: Either the 
hire is a man or the hire is not a man. 

5. The n trials are independent and the probability of 
success and probability of failure remain constant: 
This is true because we are assuming that the 
probability of hiring a man remains constant at 
7a 


Though the first condition is not met, we can still 
use the binomial distribution to model the situation. 
That the model is not perfectly met would be a 
limitation of the study. That means that we would 
want to put a caveat at the end of our conclusion to 
state that this might reduce the accuracy of our 
results. 


If the conditions of randomness and independence 


may not be fully met, then we can still utilize the 


binomial distribution. But we do have to be wary 
of the results. The other three conditions do need 
to be met to use the binomial distribution. 


Now that we have the model, we can find the 
probability. In A computer program, we will use the 
binomial distribution with n = 30 and x (or 
probability of occurrence) = 0.75. Then we will 
find P(X <20). 


From the computer program, we get P(X <20) = 
0.19659 = 19.659. Again, this probability is found 
under the assumption that the program has not 
worked (i.e. 1 =75%) 


Evaluating the probability 


The probability that we would have observed at 
most 20 hires that were men out of 30, under the 
assumption that the program did not work, is 
19.659%. Therefore, it is not unlikely that we could 
have observed this evidence as the probability is 
greater than 10%. This means that having 20 out of 
30 hires being men falls within the normal sampling 
variability for this data. 


Based on the evidence collected, there is not 
sufficient evidence to suggest that the program 


worked. Notice we don't conclude that the program 
is not working. 


In statistics, we never use the words “prove” or 
“true” when making a conclusion. All of our 
conclusions are based off of sample data that we 


are using to make a conclusion about the 
population. Therefore, there is always the chance 
of error. 


Example 


Olivier has spent five years honing his archery skills 
in various seedy locals around the world. Now he 
has returned to his city of birth to use these skills to 
take out criminals. One night while drinking vodka 
with his friends, he boasts that he can shoot an 
arrow into the bullseye, blindfolded at a distance of 
50m 90% of the time. 


“T don’t believe you!” Jack, Olivier’s best friend, 
slurred. 


“T swear! I’ve really honed my skills.” Olivier 
countered. 


“But remember last week when we were in that 


darkened factory, you missed two of your shots!” 
Thelma, Olivier’s sister, countered. 


“No. I meant to miss them.” 


Jack thought for a moment. “I think you are 
exaggerating and I’m going to test you.” 


“You’re on!” Olivier sneered arrogantly. 


To test that Olivier was exaggerating about his 
marksmanship, Jack set up a bunch of targets and, 
randomly had Olivier attempt the shot. Olivier hit 
the bullseye (blindfolded at a distance of 50m) 39 
out of 50 times. 


1. If Olivier’s is not exaggerating, how many times 
out of 50 do we typically expect him to hit the 
bullseye? Write your answer as a range that 
takes into account variation. 


Answer: We would expect Olivier to hit the bullseye 
45 times give or take 2.121 times. This means a 
typical range is 42.88 to 47.12 bullseyes out of 50. 


1. Based on your answer in a), is 39 out of 50 
times potentially abnormal? Explain. 


Answer: Since 39 is outside of the range, it would be 
deemed atypical, but that does not necessarily mean 
that it is abnormal. 


Le 


What assumption do we need to make before 
determining whether the 39 out of 50 provides 
evidence for or against Olivier exaggerating? 


Answer: Since Jack wants to show that Olivier is 
exaggerating, we want to assume that Olivier is not 
exaggerating. This means we want to assume 1 = 
90%, where x is the proportion of bullseyes that 
Olivier hits. 


de 


What model (i.e. distribution) will you use to 
test the evidence against the assumption? 
Explain why it is the best model to use. Note: 
This situation might not completely fit the 
model, but explain why it is still a reasonable 
model to use. 


Answer: The distribution satisfies the conditions of 
the binomial distribution: 


The variable being studied is random: Since Jack 
is randomly having Olivier take the shot, we 
can say this is a random event. 

The outcomes of the variable are being counted: 
We are counting the number of bullseyes. 
There are a fixed number of trials: We are 
looking at 50 shots with the bow and arrow. 
There are only two possible outcomes: Either the 
shot is a bullseye or it is not. 

The n trials are independent and the probability of 
success and probability of failure remain constant: 


This is true because we are assuming that the 
probability of hitting the bullseye remains 
constant at 90%. 


1. What probability do you need to find to 
evaluate the evidence against the assumption? 


Answer: We need to find the probability that Olivier 
hits at most 39 out of 50 bullseyes, assuming his 
accuracy is 90%. NOTE: We look at “at most 39” 
because having less bullseyes is even better evidence 
that Olivier is exaggerating (i.e. better evidence 
against the assumption). 


1. Find that probability. 


Answer: P(X < 39 given 1 =90%) 
= 0.00935 = 0.94%, (from computer program with n 
= 50, x (or probability of occurrence) = 90%). 


1. In the context of the problem, interpret the 
probability. 


Answer: The probability that Olivier hit at most 39 
out of 50 bullseyes, under the assumption that he 
wasn’t exaggerating about his accuracy is 0.94%. 


1. Does the probability provide evidence to 
support whether Olivier is exaggerating or not? 
Explain. 


Answer: Since the probability that we observed our 


sample data is less than 1%, then it is unlikely that 
that Olivier is not exaggerating (i.e. that his 
accuracy is 90%). Therefore, it is likely that Olivier 
is exaggerating and cannot hit the bullseye 90% of 
the time blindfolded from 50m. 


As stated in a previous question, the chance of 
an CRA audit for a tax return with over 
$25,000 in income is about 2% per year. An 
employee at I&S Square, a company that helps 
individuals do their yearly tax returns and helps 
if there is an audit, has noticed that people in 
Seba Beach, Alberta appear to have a greater 
chance of an audit than the rest of Canadians. 
Out of a random sample of 45 residents, four of 
them have been audited. 


1. If the residents of Seba Beach are being 
audited fairly, how many residents out of 
45 do we typically expect to get audited in 
a year? Write your answer as a range that 
takes into account variation. 

2. Based on your answer in a), is 4 out of 45 
audits potentially abnormal? Explain. 

3. What assumption do we need to make 
before determining whether the 4 out of 
45 audits is unfair? 

4. What model (i.e. distribution) will you use 
to test the assumption? Explain why it is 
the best model to use. Note: This situation 
might not completely fit the model, but 


explain why it is still a reasonable model 
to use. 


. What probability do you need to find to 


evaluate the assumption? 


. Find that probability. 
. In the context of the problem, interpret the 


probability. 


. Does the probability provide evidence to 


support or refute whether residents of Seba 
Beach are being unfairly audited? Explain. 


. We would expect 0.9 give or take 0.939 to 


be audited. This means a typical range is 0 
to 1.8 residents to be audited out of 45. 


. Since 4 is outside of the range, it would be 


deemed atypical, but that does not 
necessarily mean that it is abnormal. 


. Since the employee at I&S Square wants to 


show that something strange is happening 
in Seba Beach, they would want to assume 
that nothing strange is happening. That is, 
the rate of audits is the same in Seba Beach 
as anywhere in Canada. This means we 
want to assume 1 = 2%, where zx is the 
proportion of people who are audited. 


. The binomial distribution: 


The variable being studied is random: We are 
looking at a random sample. 
The outcomes of the variable are being 


— 


counted: We are counting the number of 
audits. 

There are a fixed number of trials: We are 
looking at 45 residents. 

There are only two possible outcomes: Either 
the resident is audited or they are not. 

The n trials are independent and the 
probability of success and probability of 
failure remain constant: This is true because 
we are assuming that the probability of 
being audited remains constant at 2%. 


. We need to find the probability that at 


least 4 out of 45 residents are audited, 
assuming an audit rate of 2%. 


. P(X=4 given t = 2%) = 0.01242 = 


1.24% (from computer program with n = 
45, « (or probability of occurrence) = 
2%). 


. The probability that at least 4 out of 45 


Seba Beach residents are audited, under 
the assumption that the audit rate is 2%, is 
1.24%. 


. Since the probability that we observed our 


sample data is between 1% and 10%, then 
we have to determine if the probability is 
unlikely or not unlikely. Since it is closer 
to 1% than 10%, we can say that the 
sample data is unlikely to have occurred 
under the assumption. Therefore, the 
evidence suggests that there is something 


wrong with the assumption. That is, there 
is evidence that the residents of Seba 
Beach are being audited at a higher rate 
than the rest of Canada. 


Executives at Bull, a Canadian own cell phone 
company, are not very happy with their current 
customer satisfaction surveys. Using a Likert 
scale, they surveyed a very large sample of 
clients who phoned Bull and spoke to a 
customer service representative. They have 
determined that only 60% of customers rate 
their overall satisfaction with the service they 
received at 4 or higher. That is, they either 
strongly agree or agreed with the statement, “I 
am happy with the overall customer service I 
received during my most recent call to Bull.” 


They feel that this is too low as 40% of 
customers were not happy with their service. To 
address these issues, they’ve brought in a 
consultant who has suggested that customers 
are happier with their service if they feel 
they’ve built a rapport with the customer 
service representative. Thus, Bull has decided to 
train their customer service representatives to 
start each call with a short conversation. As 
customers are from across Canada and it would 
be bad if the conversations were generic, to 
help their customer service representatives 


build rapport, a short notice shows up on their 
screens before they take the call that contains 
suggested conversation topic for the area the 
person is calling from. For example, it might 
include information about weather in the local 
area and how the local sporting team has done 
in their most recent game. 


After the customer service representatives have 
been trained in how to make small talk to build 
rapport, a random sample of sixty customers 
who called Bull and spoke to a customer service 
representative is taken. The participants are 
asked the same question about their overall 
satisfaction with their customer service phone 
call as stated above. The results of the survey 
are listed below: 
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Does the recent sample provide sufficient 
evidence to suggest that the proportion of 
customers who are happy with their overall 
service when they call Bull has increased from 


60%? Explain your answer in detail. 


To be skeptical, we want to assume that the 
program has not worked (i.e. a stayed at 60%). 
The evidence is 41 out of 60 customers gave a 
rating of 4 or 5. Perfect evidence that the 
program worked would be 60 out of 60 happy 
customers in every single sample. The 
probability that we would observe at least 41 
out of 60 customers who gave a score of four or 
five regarding their overall satisfaction, 
assuming that the new program has not 
worked, is 11.70% (from a computer program 
with n=60, x =0.60). Therefore, it is not 
unlikely that we observed the evidence that we 
did, under the assumption the program did not 
work. This means that we cannot conclude that 
the program worked. 


hat we can conclude when the probability is “not 
unlikely”: If the probability is greater than 10%, 
then it means that it is not unlikely that we 
observed this evidence under the assumption. We 
can NOT conclude that the assumption is likely 
true as the evidence was collected to evaluation the 
claim (not the assumption). Instead, we can only 
conclude that there is not enough evidence to say 
that the claim is true. When the probability is “not 


unlikely”, we have really learned very little about 
the claim. 


Chapter Review 


A statistical experiment can be classified as a 
binomial experiment if the following conditions are 
met: 


— 


. The variable being studied is random. 

2. The outcomes of the variable are being 
counted. 

3. There are a fixed number of trials. The letter n 
denotes the number of trials. 

4. There are only two possible outcomes, called 
"success" and "failure," for each trial 1 denotes 
the probability of a success on one trial, and 1- 
x denotes the probability of a failure on one 
trial. 

5. The n trials are independent and are repeated 

using identical conditions. Because the n trials 

are independent, the outcome of one trial does 
not help in predicting the outcome of another 
trial. Another way of saying this is that for each 
individual trial, the probability, a , of a success 
and probability, 1- 1 , of a failure remain 
constant. 


The outcomes of a binomial experiment fit a 
binomial probability distribution. The random 
variable X = the number of successes obtained in 
the n independent trials. The mean of X can be 
calculated using the formula uy = na, and the 
standard deviation is given by the formula o=nx 
C—a.): 


To evaluate evidence, we must first begin from a 
position of skepticism (i.e. assume the opposite of 
what we want to show). Then we must find a 
probability which is the distance from the actual 
evidence to perfect evidence against the assumption. 
We can then evaluate the probability by 
determining whether it is less than 1% (which 
means it is unlikely the evidence occurred under the 
assumption) or if it is greater than 10% (which 
means it is not unlikely the evidence occurred under 
the assumption). If the probability is deemed 
unlikely, then we reject the assumption, which 
means there is enough evidence to support what we 
originally wanted to show (the claim). If the 
probability is deemed not unlikely, then we do not 
reject the assumption, which means there is not 
enough evidence to support what we originally 
wanted to show (the claim). In the latter situation, 
we cannot make any conclusions about the 
assumption as the evidence was collected only for 
the claim. 


Practice 


The first few exercises provided are from the 
textbook Business Statistics -- BSTA 200 -- Humber 
College -- Version 2016RevA -- DRAFT 2016-04-04 
by Alexander Holmes, Lyryx Learning: http:// 
cnx.org/contents/ 
f3aefa9e-58d2-41ea-969f-04dc2cb04c82@5.20 


Use the following information to answer the next seven 
exercises: The Higher Education Research Institute at 
UCLA collected data from 203,967 incoming first- 
time, full-time freshmen from 270 four-year colleges 
and universities in the U.S. 71.3% of those students 
replied that, yes, they believe that same-sex couples 
should have the right to legal marital status. 
Suppose that you randomly pick eight first-time, 
full-time freshmen from the survey. You are 
interested in the number that believes that same 
sex-couples should have the right to legal marital 
status. 


In words, define the random variable X. 


X = the number that reply “yes” 


What values does the random variable X take 
on? 


0; 123 3; 45-5; 65-738 


Construct the probability distribution function 
(PDF). That is, fill in the table below. In the left 
column put in the possible values for X. In the 
right column, put in the probability for exactly 
X, i.e. P(X =x) 
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On average (u), how many would you expect to 
answer yes? 


9.7 


What is the standard deviation (0)? 


122795 


What is the probability that at most five of the 
freshmen reply “yes”? 


0.4151 


What is the probability that at least two of the 
freshmen reply “yes”? 


0.9990 


A school newspaper reporter decides to 
randomly survey 12 students to see if they will 
attend Tet (Vietnamese New Year) festivities 
this year. Based on past years, she knows that 


18% of students attend Tet festivities. We are 
interested in the number of students who will 
attend the festivities. 


1. In words, define the random variable X. 

2. List the values that X may take on. 

3. How many of the 12 students do we expect 
to attend the festivities? 

4. Find the probability that at most four 
students will attend. 

5. Find the probability that more than two 
students will attend. 


1. X = the number of students who will 
attend Tet. 

2: 0,-1.2; 3,4, 5; -6,-7,:8;.9; 10. Tl; 12 

3. 2.16 

4. 0.9511 

5. 0.3702 


Use the following information to answer the next three 
multiple choice questions: The probability that the 
Calgary Flames will win any given game is 0.4617 
based on a 45-year win history of 1,616 wins out of 
3,500 games played (as of Sept. 2017). An 
upcoming monthly schedule contains 12 games. 


The expected number of wins for that upcoming 
month is: 


de 167 

2,12 

3. 1616 3500 
4. 5.54 


d. 5.54 


Let X = the number of games won in that upcoming 
month. 


What is the probability that the Calgary Flames 
win exactly six games in that upcoming month? 


1. 0.2178 
2. 0.4167 
3. 0.7664 
4. 0.7116 


What is the probability that the Calgary Flames 
win at least five games in that upcoming month 


1. 0.2176 
2. 0.2762 
3..0:7238 
4. 0.5062 


The chance of an Canadian Revenue Agency 
audit for a tax return with over $25,000 in 
income is about 2% per year. We are interested 
in the expected number of audits a person with 
that income has in a 20-year period. Assume 
each year is independent. 


1. In words, define the random variable X. 

2. List the values that X may take on. 

3. How many audits are expected in a 20- 
year period? 

4. Find the probability that a person is not 
audited at all. 

5. Find the probability that a person is 
audited more than twice. 


1. X = the number of audits in a 20-year 
period 

Be OI De cos 20) 

3. 0.4 

4. 0.6676 

5. 0.0071 


According to The World Bank, only 9% of the 
population of Uganda had access to electricity 
as of 2009. Suppose we randomly sample 150 


people in Uganda. Let X = the number of 
people who have access to electricity. 


1. Calculate the mean and standard deviation 
of X. 

2. Find the probability that 15 people in the 
sample have access to electricity. 

3. Find the probability that at most ten 
people in the sample have access to 
electricity. 

4. Find the probability that more than 25 
people in the sample have access to 
electricity. 


1. Mean = np = 150(0.09) = 13.5; Standard 
Deviation = npg = 150(0.09)(0.91) = 


3.5050 
2. P(x = 15) = 0.0988 
3. Px < 10) = 0.1987 
4. P(x > 25) = 0.0009 


Jenna and Megan looked at the new packaging. 
“T guess it looks ok.” Megan hedged. 


“The design team says that this new packaging 
really sells the time-saving nature of the kit.” 


“But it’s kinda off-putting.” They continued to 


stare at the new packaging. Jenna and Megan 
had developed a make-up kit called ‘5 minute 
make-up’, which was aimed at women on the 
go who wanted to put ‘their face on’ but a lot 
quicker than they usually did. Their target 
market was new moms, moms with full-time 
jobs, full-time students working full-time, .... In 
other words, anyone who didn’t have 30 
minutes every morning to do their make-up. 
Their little start-up was doing well. They’d 
arranged for their product to be produced, had 
made how-to videos on YouTube, and were 
starting to get their products put into stores. 
Their dream placement was in Sephora. 


Now that they were more established, they had 
decided to hire a marketing expert who could 
help take them to the next level. The first thing 
that Leticia suggested was to change the 
packaging. She argued that their old packaging 
didn’t convey the premise of the product clearly 
enough. With the help of a design team, Leticia 
had come up with a packaging that showed a 
harried woman with hair everywhere and bags 
under her eyes looking overwhelmed. But when 
you flip the package over, the woman was now 
perfectly put together —- ‘only five-minute make- 
up can save you from being a hot mess’. 


Jenna finally broke the silence. “I don't know if 
I would even pick up this package. It just looks 


so depressing. But what do we do?” 


“Leticia wants to put the product in this new 
packaging in five stores that carry our products. 
Based off of previous sales numbers, we know 
that the stores sell 68% of the product we give 
them in a two-week period.” 


“How does that help us? Do we just watch our 
sales plummet?” Jenna was sounding 
exasperated. 


“Tm getting to that.” Megan soothed. “Leticia is 
convinced this packaging will increase sales. 
But what if we can show her that it doesn’t? 
Let’s put this packaging into the five stores and 
then see how many kits were actually sold. I bet 
that we can show her that the sales went 
down.” 


“T don’t see how that is useful. We still have to 
pay her stupid fee.” 


“You should read her contract more closely. She 
only gets paid if she can show that sales 
increased. If they don’t, then not only does she 
not get paid but she also has to pay for any 
contractors (i.e. the design team).” 


Jenna perked up visibly at this. 


Over the next two weeks, five stores carried the 


new packaging. Megan and Jenna provided 
each store with 100 kits. At the end of the two 
weeks, 306 of the kits were sold. 


1. 


Zs 
. What assumption do Jenna and Megan 


What is the observation unit? What is the 
variable? Categorize it. 
What do Jenna and Megan want to show? 


need to make in order to investigate your 
answer in question 2? Write your answer 
both as a sentence and as a probability. 


. What is the evidence that Jenna and 


Megan have found? 


. Describe the process that Jenna and Megan 


will go through to evaluate this evidence. 
Your description should include (but is not 
limited to) what probability they will find 
and what they will do with that probability 
once they’ve found it. Don’t actually do the 
process (that comes later). Just describe 
what they will do. 


. Jenna and Megan believe that the binomial 


distribution will be the best model to find 
the required probability. Does this 
situation meet the criteria for a binomial? 
Examine each criterion and comment on 
whether it is satisfied here or not. 


. Regardless of your answer above, use a 


binomial distribution to model this 
situation. Find the appropriate probability 
to evaluate the evidence using MegaStat. 


. In sentence form, explain what the 
probability you have found means in the 
context of the question. Do not make a 
conclusion yet. Instead explain what it is a 
probability of. 

. Now make a conclusion. In particular, 
answer this question: Is their enough 
evidence to suggest that Leticia’s new 
packaging has reduced sales? Justify your 
answer. 


. Observational unit: Five-minute make-up 
kit; Variable: Did it sell or not; Categorize: 
Categorical 

. They want to show that the new packaging 
will decrease sales. 

. They need to assume the opposite of what 
they want to show. Therefore, they need to 
assume that the new packaging does not 
decrease sales. Therefore, the proportion of 
kits sold stays the same at 68%. 

. They have found that out of 500 kits 
supplied, 306 of them have been sold. 

. They first need to start with an assumption 
( = 68%). Then they need to come up 
with a model based on this assumption. 
Once they have the model, they will use it 
to find the probability that the stores sold 
at most 306 out of 500 kits, assuming that 
the new packaging has not decreased sales 


6. 


(i.e. stayed at 68%). Once they have the 
probability, they need to determine 
whether the event is likely or unlikely. An 
event is unlikely if the probability is less 
than 1%. An event is likely if the 
probability is more than 10%. If the event 
is unlikely, then it means that it is unlikely 
we observed the evidence under the 
assumption. Since we know the evidence 
actually happened, that makes us question 
the assumption. Thus, it is unlikely the 
assumption is true based off of the 
evidence. If the event is likely to happen, 
then the assumption is likely to be true 
based off of the evidence. 


* Is the data randomly collected? Most 
likely not. The 500 kits that we are 
looking at were not randomly 
selected. 

Is the data discrete (countable)? As 

we are counting the number of kits 

that are sold the data is discrete. 

« Are the events independent? This may 
be a fair assumption for this study. 
Most likely the sale of one kit is not 
dependent on whether another kit is 
sold. Though if two friends buy the 
kits together or someone buys a bunch 
as presents, this is not the case, but in 
general it is more likely independent 


than dependent. 

¢ Are there a fixed number of trials? In 
this case, the number of trials would 
be the 500 kits with the new 
packaging. 

« Are there two possible outcomes? 
Either a kit is sold or it is not. 


12. P(X <36) = 0.00077 = 0.077% 

13. The probability that we observed at most 
306 out of 500 kits sold (1), assuming the 
rate of sales is 68%, is 0.077% 

14. Since the probability is less than 1%, then 
it is very unlikely that we would have 
observed this evidence under the 
assumption. Since we actually observed the 
evidence but assumed that the rate was 
68%, what we have assumed is called into 
question. Therefore, it is unlikely that the 
assumption is true . Therefore, it is likely 
that the new packaging has resulted in a 
decrease in sales . 


Striking Donkey Coffee recently sold an 80% 
stake in their company to Baravalle, an Italian 
coffee conglomerate. Striking Donkey’s logo is 
simplistic. Baravalle wants to maintain brand 
recognition, but also wants to put their stamp 
on the company. In particular, Baravalle is 
known for its modern and stylish 


advertisements. 


Designers and marketers at Baravalle have 
worked tirelessly for the last month to come up 
with two revised Striking Donkey logos (not 
included, because it is top secret). They are 
referred to as Logo 1 and Logo 2. 


Now they want to determine whether customers 
show any preference to either logo. To do this, 
they asked a random sample of 40 customers 
who were familiar with the Striking Donkey 
Brand which logo they prefer. Participants had 
to make a choice between the logos. 


The results of the study were that 26 out of the 
40 participants preferred Logo 2. 


The marketers at Baravalle now want to do a 
statistical analysis to determine whether Logo 2 
is preferred significantly more than Logo 1. 


1. What assumption do you need to start with 
when determining whether Logo 2 is 
preferred significantly more than Logo 1? 
State your answer both in a sentence and 
mathematically. 

2. Can this situation be modelled by the 
binomial distribution? Support your 
answer by showing why or why not this 
situation satisfies each of the five criteria 
of the binomial distribution. 


3. After previous issues with horrible new 
logo launches, Baravalle only wants to go 
forward if there is clear evidence that Logo 
2 is preferred. Based on this, what level of 
significance should they use? Explain your 
reasoning. 

4. Regardless of your answer in b, assume 
that this situation satisfies the binomial 
distribution for the remainder of the 
question. Use a computer program to find 
that appropriate probability that will allow 
you to evaluate the evidence. 

5. In sentence form, explain what the 
probability you have found means in the 
context of the question. Do not make a 
conclusion yet. Instead explain what it is a 
probability of. 

6. Based on the probability, determine 
whether Logo 2 is preferred significantly 
more than Logo 1. Explain your reasoning. 


Observational unit: People who are familiar 
with Striking Donkey Coffee; Variable: Whether 
they prefer Logo 2; Type of variable: 
Categorical. 


1. They need to assume the opposite of what 
they want to show. This means they need 
to assume that Logo 2 is NOT preferred 
significantly more than Logo 1. This would 


mean they are preferred equally. 
Therefore, there is a 50% chance that 
someone will choose Logo 2. 


2. ° The data is collected randomly: Yes. It 
is arandom sample of participants. 

* The outcomes are counted: Yes. They 
count how many people like Logo 2. 

* There are two possible outcomes: Yes. 
Either they prefer Logo 2 or they did 
not. 

¢ There are a fixed number of trials: 
Yes. They asked 40 people. 

* The trials are independent of each 
other. Yes. It is fair to assume that no 
participant’s preference is based on 
another participants preference. 


8. The more unlikely it is that we observed 
our evidence, the smaller the probability 
will be. This means, the smaller the 
probability, the more unlikely it is that the 
assumption (i.e. that there is no preference 
between the logos) is true. Since the 
marketers want clear evidence that there is 
a preference, they want a smaller 
probability, which would show it is 
unlikely that there is a preference between 
the logos. The level of significance is the 
threshold between likely and unlikely. 
Thus, if they want clear evidence, they 


11. 


want to set their threshold “high”, 
meaning they want to make it a small 
number. Since the level of significance is 
between 1% and 10%, the lowest level of 
significance (meaning the highest 
threshold of evidence) is at 1%. 


. P(X= 26) = 0.04035 = 4.04% 
. The probability that we observed at least 


26 out of 40 people who preferred Logo 2, 
assuming that there was no preference 
between the logos, is 4.04%. 

Since the probability is greater than 1% (it 
is 4.04%), it is not unlikely that we 
observed at least 26 out of 40 people who 
preferred Logo 2, assuming that there was 
no preference between the logos. 
Therefore, we do not reject that there was 
no preference between the logos. This 
suggests that Logo 2 is NOT preferred 
significantly more than Logo 1. 


Introduction -- The Normal Distribution -- Mt Royal 
University -- Version 2016RevA 

class ="introduction" If you ask enough people 
about their shoe size, you will find that your 
graphed data is shaped like a bell curve and can be 
described as normally distributed. (credit: Omer 
Unlv) 


Chapter Objective 
By the end of this chapter, the student should be 
able to: 


* Recognize the normal probability distribution 
and apply it appropriately. 

* Recognize the standard normal probability 
distribution and apply it appropriately. 

* Compare normal probabilities by converting 


to the standard normal distribution. 


The normal probability density function, a 
continuous distribution, is the most important of all 
the distributions. It is widely used and even more 
widely abused. Its graph is bell-shaped. You see the 
bell curve in almost all disciplines. Some of these 
include psychology, business, economics, the 
sciences, nursing, and, of course, mathematics. 
Some of your instructors may use the normal 
distribution to help determine your grade. Most IQ 
scores are normally distributed. Often real-estate 
prices fit a normal distribution. 


The normal distribution is extremely important, but 
it cannot be applied to everything in the real world. 
Remember here that we are still talking about the 
distribution of population data. This is a discussion 
of probability and thus it is the population data that 
may be normally distributed, and if it is, then this is 
how we can find probabilities of specific events just 
as we did for population data that may be 
binomially distributed or Poisson distributed. This 
caution is here because in the next chapter we will 
see that the normal distribution describes something 
very different from raw data and forms the 
foundation of inferential statistics. 


In this chapter, you will study the normal 


distribution, the standard normal distribution, and 
applications associated with them. 


The normal distribution has two parameters (two 
numerical descriptive measures), the mean (Ww) and 
the standard deviation (0). If X is a quantity to be 
measured that has a normal distribution with mean 
(u) and standard deviation (0), we designate this by 
writing the following formula of the normal 
probability density function: 


NORMAL: X~N (uw, o) 


Lt 


The curve is symmetrical about a vertical line drawn 
through the mean, yw. The mean is the same as the 
median, which is the same as the mode, because the 
graph is symmetric about y. As the notation 
indicates, the normal distribution depends only on 
the mean and the standard deviation. Note that this 
is unlike several probability density functions we 
have already studied, such as the Poisson, where the 
mean is equal to »p and the standard deviation 
simply the square root of the mean, or the binomial, 
where p is used to determine both the mean and 


standard deviation. Since the area under the curve 
must equal one, a change in the standard deviation, 
o, causes a change in the shape of the curve; the 
curve becomes fatter and wider or skinnier and 
taller depending on o. A change in pu causes the 
graph to shift to the left or right. This means there 
are an infinite number of normal probability 
distributions. One of special interest is called the 
standard normal distribution. 


Formula Review 
X ~ Nu, 0) 


ut = the mean o = the standard deviation 


Glossary 


Normal Distribution 
a continuous random variable (RV) with pdf 
f(x) = 102m e-(kK-U) 2022, where p is 
the mean of the distribution and o is the 
standard deviation; notation: X ~ N(u, o). If 
= 0 ando = 1, the RV, Z, is called the 
standard normal distribution. 


The Standard Normal Distribution-- The Normal 
Distribution --MRU - C Lemieux 


The standard normal distribution is a normal 
distribution of standardized values called z- 
scores. A z-score is measured in units of the 
standard deviation. For example, if the mean of a 
normal distribution is five and the standard 
deviation is two, the value x = 11 is three standard 
deviations above (or to the right of) the mean. The 
calculation is as follows: 


x=pu+ (Z)(CO) =5+ (3)Q) = 11 
The z-score is three. 


The mean for the standard normal distribution is 
zero, and the standard deviation is one. What this 
does is dramatically simplify the mathematical 
calculation of probabilities. Take a moment and 
substitute zero and one in the appropriate places in 
the above formula and you can see that the equation 
collapses into one that can be much more easily 
solved using integral calculus. The transformation z 
= x—uo produces the distribution Z ~ N(O, 1). The 
value x comes from a known normal distribution 
with known mean pu and known standard deviation 
o. The z-score tells how many standard deviations a 
particular x is away from the mean. 


Z-Scores 


If X is a normally distributed random variable and X 
~ N(u, 0), then the z-score is: 
Z= X-LO 


The z-score tells you how many standard 
deviations the value x is above (to the right of) 
or below (to the left of) the mean, pu. Values of x 
that are larger than the mean have positive z-scores, 
and values of x that are smaller than the mean have 
negative z-scores. If x equals the mean, then x has a 
z-score of zero. 


Suppose X ~ N(5, 6). This says that x is a normally 
distributed random variable with mean p = 5 and 
standard deviation o = 6. Suppose x = 17. Then: 
Z= xo = 17-56 =2 

This means that x = 17 is two standard 
deviations (20) above or to the right of the mean pu 
= 5. The standard deviation is o = 6. 

INow suppose x = 1. Then: zg = x10 = 1-56 = - 
0.67 (rounded to two decimal places) 

This means that x = 1 is 0.67 standard 
deviations (—0.670) below or to the left of the 
mean wl = 5. 


Some doctors believe that a person can lose five 
pounds, on average, in a month by reducing his or 
her fat intake and by exercising consistently. 
Suppose weight loss has a normal distribution. Let 

= the amount of weight lost(in pounds) by a 
person in a month. Use a standard deviation of two 
pounds. X ~ N(5, 2). 


Suppose a person gained three pounds (a 
negative weight loss). Then z = . This 
z-score tells you that x = —3 is standard 
deviations to the (right or left) of the 
mean. 


Z=x-lo=-3-52=-4 


z = -4. This z-score tells you that x = -3 is 
four standard deviations to the left of the 
mean. 


Suppose the random variables X and Y have the 
following normal distributions: X ~ N(5, 6) and Y ~ 
N(2, 1). If x = 17, then z = 2. (This was previously 
shown.) If y = 4, what is z? 


Z=y-po = 4-21 = 2 where yw = 2ando = 1. 


The z-score for y = 4 is z = 2. This means that four 


is g = 2 standard deviations to the right of the 
mean. Therefore, x = 17 and y = 4 are both two 
(of their own) standard deviations to the right of 
their respective means. 


The z-score allows us to compare data that are 
scaled differently. To understand the concept, 
suppose X ~ N(5, 6) represents weight gains for one 
group of people who are trying to gain weight in a 
six week period and Y ~ N(2, 1) measures the same 
weight gain for a second group of people. A 
negative weight gain would be a weight loss. Since x 
= 17 andy = 4 are each two standard deviations to 
the right of their means, they represent the same, 
standardized weight gain relative to their means. 


Try It 


Fill in the blanks. 


Jerome averages 16 points a game with a 
standard deviation of four points. X ~ N(16,4). 
Suppose Jerome scores ten points in a game. 


The z-score when x = 10 is —1.5. This score 
tells you that x = 10is___ standard 
deviations to the ____ (right or left) of the 
mean___ (What is the mean?). 


eS ete 


The Empirical Rule 

If X is a random variable and has a normal 
distribution with mean py and standard deviation o, 
then the Empirical Rule says the following: 


About 68.26% of the x values lie between -—1o0 
and +10 of the mean yw (within one standard 
deviation of the mean). 

About 95.44% of the x values lie between —20 
and + 20 of the mean yw (within two standard 
deviations of the mean). 

About 99.73% of the x values lie between —30 
and +30 of the mean yu (within three standard 
deviations of the mean). Notice that almost all 
the x values lie within three standard 
deviations of the mean. 

The z-scores for +1oand-—lo are +1 and -1, 
respectively. 

The z-scores for +20 and —2o0 are +2 and —2, 
respectively. 

The z-scores for +30 and —30 are +3 and -3 
respectively. 


The empirical rule is also known as the 68-95-99.7 
rule. 


The mean height of 15 to 18-year-old males from 
Chile from 2009 to 2010 was 170 cm with a 
standard deviation of 6.28 cm. Male heights are 
known to follow a normal distribution. Let X = the 
height of a 15 to 18-year-old male from Chile in 
2009 to 2010. Then X ~ N(170, 6.28). 


a. Suppose a 15 to 18-year-old male from Chile 
was 168 cm tall from 2009 to 2010. The z- 
score when x = 168cmisz = ___.. This z- 
score tells you that x = 168 is standard 
deviations to the (right or left) of the 
mean ___ (What is the mean?). 


Z=x-Uo= 168-1706.28 = -0.32 


a. —0.32, 0.32, left, 170 


b. Suppose that the height of a 15 to 18-year- 

old male from Chile from 2009 to 2010 has a 

z-score of g = 1.27. What is the male’s height? 

The z-score (2 = 1.27) tells you that the male’s 

height is standard deviations to the 
(right or left) of the mean. 


ZL. — X= 
Uo = x-1706.28 = 1.27 1.27*6.28 + 170 =177.98 


b. 177.98, 1.27, right 


Try It 


In 2012, 1,664,479 students took the SAT 
exam. The distribution of scores in the verbal 
section of the SAT had a mean yp = 496 anda 
standard deviation o = 114. Let X = a SAT 
exam verbal section score in 2012. Then X ~ 
N(496, 114). 


Find the z-scores for x1 = 325 and x2 = 
366.21. Interpret each z-score. What can you 
say about x1 = 325 and x2 = 366.21? 


The z-score for x1 a2 ois 21 —— a lay 
The z-score for x2 = 366.21 is z2 = -1.14. 
Student 2 scored closer to the mean than 


Student 1 and, since they both had negative z- 
scores, Student 2 had the better score. 


Suppose x has a normal distribution with mean 50 
and standard deviation 6. 


¢ About 68% of the x values lie between -lo = 
(-1)(6) = -6 and lo = (1)(6) = 6 of the 


mean 50. The values 50 -6 = 44 and 50 + 6 
= 56 are within one standard deviation of the 
mean 50. The z-scores are —1 and +1 for 44 
and 56, respectively. 

About 95% of the x values lie between —20 = 
(—2)(6) = -12 and 20 = (2)(6) = 12. The 
values 50 -12 = 38 and 50 + 12 = 62 are 
within two standard deviations of the mean 
50. The z-scores are —2 and +2 for 38 and 62, 
respectively. 

About 99.7% of the x values lie between —30 
= (-3)(6) = -18 and 30 = (3)(6) = 18 of the 
mean 50. The values 50 -18 = 32 and 50 + 
18 = 68 are within three standard deviations 
of the mean 50. The z-scores are —3 and +3 
for 32 and 68, respectively. 


Try It 


Suppose X has a normal distribution with 
mean 25 and standard deviation five. Between 
what values of x do 68% of the values lie? 


between 20 and 30. 


Try It 


The scores on a college entrance exam have an 
approximate normal distribution with mean, u 
= 52 points and a standard deviation, o = 11 
points. 


1. About 68% of the y values lie between 
what two values? These values are 
. The z-scores are 
, respectively. 
2. About 95% of the y values lie between 
what two values? These values are 
. The z-scores are 
, respectively. 
3. About 99. 7% of the y values lie between 
what two values? These values are 
. The z-scores are 
, respectively. 


1. About 68% of the values lie between the 
values 41 and 63. The z-scores are —1 and 
1, respectively. 

2. About 95% of the values lie between the 
values 30 and 74. The z-scores are —2 and 


2, respectively. 

3. About 99.7% of the values lie between 
the values 19 and 85. The z-scores are —3 
and 3, respectively. 
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Chapter Review 


A z-score is a standardized value. Its distribution is 
the standard normal, Z ~ N(0O, 1). The mean of the 
z-scores is zero and the standard deviation is one. If 
z is the z-score for a value x from the normal 
distribution N(u, o) then z tells you how many 
standard deviations x is above (greater than) or 
below (less than) wu. 


Practice Questions 


In a normal distribution, x = 3 and z = 0.67. 
This tells you that x = 3is___ standard 
deviations to the ___ (right or left) of the mean. 


0.67, right 


In a normal distribution, x = -5 and z = -3.14. 
This tells you that x = -5is___ standard 
deviations to the ___ (right or left) of the mean. 


3.14, left 


About what percent of x values from a normal 
distribution lie within one standard deviation 
(eft and right) of the mean of that distribution? 


about 68% 


About what percent of the x values from a 
normal distribution lie within two standard 
deviations (left and right) of the mean of that 
distribution? 


about 95.44% 


About what percent of x values lie between the 
second and third standard deviations (both 
sides)? 


about 4% 


Use the following information to answer the next two 
multiple choice exercises: The patient recovery time 
from a particular surgical procedure is normally 
distributed with a mean of 5.3 days and a standard 
deviation of 2.1 days. 


What is the median recovery time? 


| ees 
PA ee, 
3. 7.4 
4.2.1 


What is the z-score for a patient who takes ten 
days to recover? 


| ae ee) 
2-0-2 
ce ae 
4.7.3 


Wesley Crusher was tasked with exploring the 
Selcundi Drema sector. He found a new 
species of tribbles. In his final report, he stated, 
“Though tribbles vary in size and dimension, 
the middle 99.73% of them weigh between 4 
and 7.2 kg and follow a normal distribution.” 
Based on this, what is the mean and standard 
deviation for the weight of tribbles? Choose the 
best answer. 


1. mean = 5.6 kg, standard deviation = 1.07 
kg 

2. mean = 5.6 kg, standard deviation = 0.53 
kg 

3. mean = 5.6 kg, standard deviation = 0.8 
kg 


4. mean = 99.73 kg, standard deviation = 
3.2 kg 

5. There is not enough information to 
determine this. 


Glossary 


Standard Normal Distribution 
a continuous random variable (RV) X ~ N(O, 
1); when X follows the standard normal 
distribution, it is often noted as Z ~ N(O, 1). 


z-score 
the linear transformation of the form z = x-u 
Oo or written as z = |x- | o; if this 
transformation is applied to any normal 
distribution X ~ N(u, o) the result is the 
standard normal distribution Z ~ N(0,1). If 
this transformation is applied to any specific 
value x of the RV with mean yu and standard 
deviation o, the result is called the z-score of 
x. The z-score allows us to compare data that 
are normally distributed but scaled 
differently. A z-score is the number of 
standard deviations a particular x is away 
from its mean value. 


Using the Normal Distribution-- The Normal 
Distribution -- MRU - C Lemieux 


The shaded area in the following graph indicates the 
area to the right of x. This area is represented by the 
probability P(x > x). Normal tables, computers, and 


calculators provide or calculate the probability PX 
> x), 


Shaded area 
represents probability 
P (X=x,) 


The area to the right is then PX > x) = 1-P(XX < 
x). Remember, P(X < x) = Area to the left of the 
vertical line through x. PX < x) = 1-P(X < x) = 
Area to the right of the vertical line through x. P(X 
< x) is the same as P(X < x) and P(X > x) is the 
same as P(X = x) for continuous distributions. 


Calculations of Probabilities 


To find the probability for probability curves with a 
continuous random variable we need to calculate 


the area under the curve across the values of X we 
are interested in. For the normal distribution this 
seems a difficult task given the complexity of the 
formula. There is, however, a simply way to get 
what we want. 


We start knowing that the area under a probability 
curve is the probability. 


U 
P(x, SX $X,) 


This shows that the area between X1 and X2 is the 
probability as stated in the formula: P (X1 < x < 
X2) 


The mathematical tool needed to find the area 
under a curve is integral calculus. The integral of 
the normal probability density function between the 
two points x1 and x2 is the area under the curve 
between these two points and is the probability 
between these two points. 


Doing these integrals is no fun and can be very time 


consuming. But now, remembering that there are an 
infinite number of normal distributions out there, 
we can consider the one with a mean of zero and a 
standard deviation of 1. This particular normal 
distribution is given the name Standard Normal 
Distribution. Putting these values into the formula it 
reduces to a very simple equation. We can now 
quite easily calculate all probabilities for any value 
of x, for this particular normal distribution, that has 
a mean of zero and a standard deviation of 1. These 
have been produced and are available here in the 
text or everywhere on the web. They are presented 
in various ways. The table in this text is the most 
common presentation and is set up with 
probabilities for one-half the distribution beginning 
with zero, the mean, and moving outward. The 
shaded area in the graph at the top of the table 
represents the probability from zero to the specific Z 
value noted on the horizontal axis, Z. 


The only problem is that even with this table, it 
would be a ridiculous coincidence that our data had 
a mean of zero and a standard deviation of one. The 
solution is to convert the distribution we have with 
its mean and standard deviation to this new 
Standard Normal Distribution. The Standard Normal 
has a random variable called Z. 


Using the standard normal table, typically called the 
normal table, to find the probability of one standard 
deviation, go to the Z column, reading down to 1.0 


and then read at column 0. That number, 0.3413 is 
the probability from zero to 1 standard deviation. At 
the top of the table is the shaded area in the 
distribution which is the probability for one 
standard deviation. The table has solved our integral 
calculus problem. But only if our data has a mean of 
zero and a standard deviation of 1. 


However, the essential point here is, the probability 
for one standard deviation on one normal 
distribution is the same on every normal 
distribution. If the population data set has a mean of 
10 and a standard deviation of 5 then the 
probability from 10 to 15, one standard deviation, is 
the same as from zero to 1, one standard deviation 
on the standard normal distribution. To compute 
probabilities, areas, for any normal distribution, we 
need only to convert the particular normal 
distribution to the standard normal distribution and 
look up the answer in the tables. As review, here 
again is the standardizing formula: 

Z=X-lO 


where Z is the value on the standard normal 
distribution, X is the value from a normal 
distribution one wishes to convert to the standard 
normal, ut and o are, respectively, the mean and 
standard deviation of that population. Note that the 
equation uses tp and o which denotes population 
parameters. This is still dealing with probability so 
we always are dealing with the population, with 


known parameter values and a known distribution. 
It is also important to note that because the normal 
distribution is symmetrical it does not matter if the 
z-score is positive or negative when calculating a 
probability. One standard deviation to the left 
(negative Z-score) covers the same area as one 
standard deviation to the right (positive Z-score). 
This fact is why the Standard Normal tables do not 
provide areas for the left side of the distribution. 
Because of this symmetry, the Z-score formula is 
sometimes written as: 

Z=|x-ulo 


Where the vertical lines in the equation means the 
absolute value of the number. 


What the standardizing formula is really doing is 
computing the number of standard deviations X is 
from the mean of its own distribution. The 
standardizing formula and the concept of counting 
standard deviations from the mean is the secret of 
all that we will do in this statistics class. The reason 
this is true is that all of statistics boils down to 
variation, and the counting of standard deviations is 
a measure of variation. 


This formula, in many disguises, will reappear over 
and over throughout this course. 


The final exam scores in a statistics class were 
normally distributed with a mean of 63 and a 
standard deviation of five. 


a. Find the probability that a randomly 
selected student scored more than 65 on the 


exam. 
b. Find the probability that a randomly 
selected student scored less than 85. 


a. Let X = ascore on the final exam. X ~ 
N(63, 5), where p = 63 ando = 5 


Draw a graph. 
Then, find P(x > 65). 


P(x > 65) = 0.3446 


Z1 = x1 — po = 65-635 = 0.4 
Poe a) — F(Z 71) — 203446 


The probability that any student selected at 
random scores more than 65 is 0.3446. Here is 
how we found this answer. 


The normal table provides probabilities from 
zero to the value Z1. For this problem the 
question can be written as: P(X = 65) = P(Z 
> Z1), which is the area in the tail. To find 
this area the formula would be 0.5 - P(X < 
65). One half of the probability is above the 
mean value because this is a symmetrical 
distribution. The graph shows how to find the 
area in the tail by subtracting that portion 


from the mean, zero, to the Z1 value. The final 
answer is: P(X = 63) = P(Z = 0.4) = 0.3446 


Z= 65-635 = 0.4 


Area to the left of Z1 to the mean of zero is 
0.1554 


P(x > 65) = P(g > 0.4) = 0.5- 0.1554 = 
0.3446 


bi 


Z = x-U0 = 85-635 = 4.4 which is larger than the 
maximum value on the Standard Normal 
Table. Therefore, the probability that one 
student scores less than 85 is approximately 
one or 100%. 


A score of 85 is 4.4 standard deviations from 
the mean of 63 which is beyond the range of 
the standard normal table. Therefore, the 
probability that one student scores less than 85 
is approximately one (or 100%). 


Try It 


The golf scores for a school team were 
normally distributed with a mean of 68 and a 
standard deviation of three. 


Find the probability that a randomly selected 
golfer scored less than 65. 


normalcdf(1099,65,68,3) = 0.1587 


personal computer is used for office work at 
home, research, communication, personal finances, 
education, entertainment, social networking, and a 
myriad of other things. Suppose that the average 
number of hours a household personal computer is 
used for entertainment is two hours per day. 
Assume the times for entertainment are normally 
distributed and the standard deviation for the times 
is half an hour. 


a. Find the probability that a household 
personal computer is used for entertainment 
between 1.8 and 2.75 hours per day. 


a. Let X = the amount of time (in hours) a 
household personal computer is used for 
entertainment. X ~ N(2, 0.5) where p = 2 and 


0 — 0.5. 
Find P(1.8 < x < 2.75). 
The probability for which you are looking is 


the area between x = 1.8 and x = 2.75. P(1.8 
<< 2/5) — 20) 5600 


Z, u=O0 Z 


1 2 


POS =x = 275) — PCr = Zaz) 
The probability that a household personal 


computer is used between 1.8 and 2.75 hours 
per day for entertainment is 0.5886. 


b. Find the maximum number of hours per day 


that the bottom quartile of households uses a 
personal computer for entertainment. 


b. To find the maximum number of hours per 
day that the bottom quartile of households 
uses a personal computer for entertainment, 
find the 25th percentile, k, where P(x < k) 
= 0.25. 


k=1.66 


Shaded area Unshaded area 
represents probability represents 
P(x <k)=0.25 probability 

P(x >k)=0.75 


f(Z) = 0.5-0.25 =0.25, therefore Z = -0.675(or 
just 0.67 using the table)Z =x- 

Uo = x-20.5 = -0.675, therefore 

x =-0.675*0.5+ 2= 1.66 hours. 


The maximum number of hours per day that 
the bottom quartile of households uses a 
personal computer for entertainment is 1.66 
hours. 


Try It 


The golf scores for a school team were 
normally distributed with a mean of 68 and a 
standard deviation of three. Find the 
probability that a golfer scored between 66 
and 70. 


normalcdf(66,70,68,3) = 0.4950 


There are approximately one billion smartphone 
users in the world today. In the United States the 
ages 13 to 55+ of smartphone users approximately 
follow a normal distribution with approximate 
mean and standard deviation of 36.9 years and 
13.9 years, respectively. 


a. Determine the probability that a random 
smartphone user in the age range 13 to 55+ is 
between 23 and 64.7 years old. 


a. 0.8186 


b. Determine the probability that a randomly 
selected smartphone user in the age range 13 
to 55+ is at most 50.8 years old. 


b. 0.8413 


A citrus farmer who grows mandarin oranges finds 
that the diameters of mandarin oranges harvested 
on his farm follow a normal distribution with a 
mean diameter of 5.85 cm and a standard 
deviation of 0.24 cm. 


a. Find the probability that a randomly 
selected mandarin orange from this farm has a 
diameter larger than 6.0 cm. Sketch the graph. 


uU=0 Z 
il =O = 5.85.24 — -625 


1 


P(x = 6) = P(@ = 0.625) = 0.2670 


b. The middle 20% of mandarin oranges from 
this farm have diameters between and 


f(Z) = 0.20 2 = 0.10, therefore Z = + 0.25 
Z=xyo = x-5.850.24 = +0.25—->+ 


O25) 7 024-4 3.35) — (9.795.911) 
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Practice questions 


A local bank has determined that the daily 
balances of the chequing accounts of their 


customers are normally distributed with a mean 
of $280 and a standard deviation of $20. 


1. What percentage of their customers has 
daily balances less than $290? 

2. What percentage of their customers has 
daily balances between $250 and $275? 

3. What percentage of their customers has 
daily balances over $260? 

4. The Bank is planning a special promotion 
where it is rewarding its customers whose 
balances are in the top 15% with a free 
toaster. What account balance must a 
customer achieve in order to qualify for a 
free toaster? 

5. 68.26% of balances will be between what 
amount? 

6. What is the interquartile range for the 
account balances? 


. 0.6915 

. 0.3345 

. 0.8413 

. $300.70 

. $260 to $300 

. IQR = 293.5-266.5 = $27 


Ow1BRWN eH 


The Old Baldy Tire Company is introducing a 
new steel belted radial tire. Old Baldy's 


engineering department has estimated that the 
average life of this tire will be 50,000 km and 
that the standard deviation of these tires will be 
5000 km. It is assumed that the useful life of 
these tires follows a normal distribution. 


1. What is the probability that: 


1. These tires will last for longer than 
60,000 km? 

2. These tires will last for less than 
38,000 km? 

3. These tires will last for between 
45,000 and 58,000 km? 

4. These tires will last for between 
39,000 and 43,000 km? 


2. The Old Baldy Tire Company is 
considering offering a tire guarantee that 
each new set of tires will last a certain 
number of kilometers. If the tires fail to 
last the specified number of kilometers a 
new set of tires will be provided to the 
purchaser for free. The Old Baldy Tire 
Company wants to ensure that no more 
than 10% of the tires produced qualify for 
this guarantee. For how many kilometers 
should these tires be guaranteed to last? 

3. 35% of tires will last less than how many 
kilometers? 


de _2.:0,0228 
2. 0.0082 
3.:0:7865 
4. 0.0669 


2. 56407.8 km 
3. 48073.4 km 


Introduction - Sampling distributions - MRU - C 
Lemieux 
Introduction to section on sampling distributions 


Chapter Objective 
By the end of the chapter, the student should be 
able to: 


Distinguish between a sample, a population 
and a sampling distribution. 


Know the characteristics of the sampling 
distribution. 


Recognize sampling distribution problems. 
Apply and interpret the central limit theorem 
for both means and proportions. 


Introduction to Sampling Distributions 
Introduces the concept of sampling distributions. 


When we take a random sample from a population, 
we expect that there is going to be some variability 
(i.e. sampling variability) between the information 
the sample gives us and the whole population. That 
is, we might find that the sample mean and the 
population mean are different. We may also find 
that if we take multiple random samples of size n 
that the sample mean for each sample is different. 
The following chapter looks at how we can better 
understand the sampling variability in statistics. 


Before we go on, here is a reminder of a few terms 
and symbols. 


A parameter is a descriptive measure of the 
population (eg. population mean, population 
standard deviation, population proportion). 


A statistic is a descriptive measure of the sample 
(eg. sample mean, sample standard deviation, 
sample proportion). 
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Table of important symbols 


The population mean, population standard 
deviation, and sample standard deviation have a 
subscript of x to demonstrate that they are the 
measure for the variable X. Though this is mostly 
notational, it does become important later in this 
chapter. 

Number of women in each sample of size 100 This 
is based off of calculations on how long it would 
take a network of supercomputer in 2011 to work 
through all possible combinations of a 256 bit 
encryption. By the way, there are less possibilities in 
a 256 bit encryption than there are all possible 
samples of size 100 from a population of 12,000. 


What is a sampling distribution? 


Suppose we take many different random samples of 
100 university students from a university that has 
an equal number of men and women. 


The number of women will vary amongst the 
samples. For example, one sample could have 45 
women, another sample could have 48 women, 


another sample could have 52 women, etc. 


Though it could be possible that we get a random 
sample that only has 2 women in it, it would be 
pretty unlikely. Instead, we would expect that most 
of the samples would have around 50 women in it 
with some variation around that value. 


Figure 1 is the result of a simulation that took 
10,000 samples of size 100 from a population that 
had an equal about of women and men. The 
horizontal axis is the number of women in each 
sample. The height of each bar is the number of 
samples that had that many women. 


30 35 40 45 50 55 60 65 70 


Notice how the most common number of women is 
around 50 (i.e. the average), but there is variation 
from that 50. Most samples have between 40 and 60 
women. 


The variability among random samples of size n 
from the same population is called sampling 
variability. 


A probability distribution that characterizes some 
aspect of sampling variability is termed a sampling 
distribution. A sampling distribution is constructed 
by taking all possible samples of a size n from a 
population. Then for each sample, a statistic is 
calculated (e.g. sample mean, sample proportion, 
sample standard deviation). The sampling 
distribution is then created by making a graph of all 
of these samples. 


Actually constructing a sampling distribution is 
often very difficult. A medium sized university in 
Canada might have 12,000 students. All possible 
samples of size 100 from that population would 
result in 5.87 x 10249 unique samples! Think about 
that. One billion is 109 . Google is named after a 
googol ( 10100 ) because they wanted Google to be 
associated with an immense amount of data. Yet a 
googol is smaller than all possible samples at 100 
from the medium sized university. If we got a 
computer to find all possible samples, it would take 
it over a billion years to find them [footnote]! 
Therefore, actually constructing a true sampling 
distribution in most situations is incredibly hard, 
incredibly time consuming, and not really worth it. 
Thus when we talk about sampling distributions, we 
talk about a theoretical sampling distribution. 


That is, we theorize what this sampling distribution 
would look like if it was possible to examine all 
possible samples. 


Due to these limitations, we often look at an 
empirical sampling distribution, instead of a 
theoretical sampling distribution. An empirical 
sampling distribution is created by taking many 
samples from a population and finding a statistic for 
each sample, but not doing this for all possible 
samples. The plot shown in is an example of an 
empirical sampling distribution as it only contains 
10,000 samples and not all possible samples. The 
statistic in is the number of women, but we could 
have also looked at the proportion of women. 


In summary, a sampling distribution is a distribution 
of a statistic. This differs from other distributions, 
like the population distribution, which are 
distributions for individual data values. 


Why do we care about sampling 
distributions? 


Suppose we take a random sample of 100 students 
from a medium sized university and we find that 75 
of them are women. Does this call into question the 
assumption that 50% of the students are women? 
This is hard to figure out unless we know how likely 
it is that we could have found this random sample, 


assuming that there are an equal number of men 
and women. 


The sampling distribution helps us find this 
probability. From the empirical sampling 
distribution in Figure 1 we can find the probability 
of getting a random sample of 75 women, assuming 
that there are an equal number of men and women 
is 0.0000%. That is, it is really unlikely to get a 
random sample of 75 women out of 100 if there are 
an equal number of men and women in the 
population. Based on this, we can be fairly confident 
that this university probably doesn’t have an equal 
number of men and women. Instead, it is more 
likely that there are women than men at this 
university. 


The process described above is called inferential 
statistics. Inferential statistics is used to make a 
conclusion about the population (all students at the 
university) from a sample (100 students). In general, 
to do any form of inferential statistics, we need to 
use a sampling distribution to either determine how 
likely or unlikely a statistic is (in hypothesis testing) 
or to estimate a parameter from a statistic 
(confidence intervals). 


Thus sampling distributions are the backbone of 
inferential statistics. 


Note: What was described above about the 


proportion of women at a university should sound 
familiar. In Chapter 4, we used the binomial 
distribution to determine how not unlikely or 
unlikely events were. The binomial distribution was 
helping us understand the sampling distribution of 
proportions. 


Constructing empirical sampling distributions - MRU 
- C Lemieux 

Introduction to how to construct a sampling 
distribution. 


How to construct an emprical sampling 
distribution 


If we have access to the population, we can 
construct an empirical distribution from it. This can 
be done by using computer software to pull random 
samples from a population. An example of one such 
tool is from the Rossman Chance website, which has 
an applet that allows you to create an empirical 
sampling distribution from a finite population: 
http://www.rossmanchance.com/applets/ 
OneSample53.html 


When constructing an empirical sampling 
distribution, it is important to keep the law of large 
numbers in mind. That is, the more samples you 
take, the closer the empirical sampling distribution 
will be to the theoretical sampling distribution. In 
general, empirical sampling distributions should be 
constructed from at least 10,000 samples. 


To get an idea of how an empirical sampling 
distribution is constructed, go to http:// 
onlinestatbook.com/stat_sim/sampling dist/ 


index.html 


The images/figures in this example were generated 
from David Lane's sampling distribution applet that 
is part of the OnlineStatBook project [footnote]. 
Online Statistics Education: A Multimedia Course 
of Study (http://onlinestatbook.com/). Project 
Leader: David M. Lane, Rice University. 

Figure 1 shows the histogram of the population we 
are going to generate an empirical sampling 
distribution from. We call this population the 
parent population as it is the population we are 
creating the sampling distribution from. Notice that 
the parent population is skewed left. 


Parent population 
Parent population (can be changed with the mouse) 


0 322 
We are going to take multiple samples of size 10 
from the parent population and look at the statistic 
of the sample mean for each sample. 
Here is the first sample: 
Sample of size 10 from the parent population 


Sample Data ae 
6 
5 
4 
3 
2 
1 
. 0 
This is the sample mean of the sample: 


Sample mean for one sample of size 10 
Distribution of Means, N=10 
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0 32 
INow one sample mean is not enough to tell us 
what the sampling distribution looks like. So let’s 
take a few more samples. Let’s take 5 more samples 
of size 10 and plot their sample means: 


Six sample means from parent population 
Distribution of Means, N=10 


coer nw ew ea 
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This is still a pretty small sample. 


here are two sample sizes here. One is the size of 
the sample we are taking from the parent 

opulation (10). The other is the number of 
samples we’ve taken (6). The first is the sample 


size for the sample. The second is the sample size 
for the empirical sampling distribution. 


Now let’s take 10,000 samples of size 10 from the 
population and plot each of their sample means. 
This is what we get: 


10,000 sample means from parent population 
Distribution of Means, N=10 


32 


a 
Finally, let take 100,000 samples of size 10 from 
the population and plot each of their sample 
means. This is what we get: 


100,000 sample means from parent population 
Distribution of Means, N=10 
56 


0 32 


Notice how there is no real difference between the 
distributions (shape, centre and variation) in Figure 
5 and Figure 6. This means that are empirical 
distribution is now giving us a good sense of what 
the theoretical sampling distribution would look 
like. When this happens, this is called 
convergence. That is, the empirical sampling 


distribution is converging on the theoretical 
sampling distribution. As the sample size of the 
empirical sampling distribution increases this is 
expected to happen due to law of large numbers. 


Bootstrapping 


Suppose we don’t have access to the population. 
This can happen if the population is infinite (e.g. in 
a manufacturing process) or where the population is 
large (e.g. population of Canada) or where most 
researchers wouldn’t have access to the population 
(e.g. list of students at a university). Can we still 
construct an empirical sampling distribution? 


The answer is yes! To do this, we use a process 
called bootstrapping. Essentially bootstrapping 
follows the same procedure as outlined in Example 
1, but instead of using a parent population, we use a 
parent sample. That is, we take a good sample from 
the population and use that to construct the 
sampling distribution. 


Again the law of large numbers applies. If the 
random sample from the population is large enough, 
then the sample will most likely be a good estimate 
of the population. Then the empirical sampling 
distribution generated by the sample will most 
likely be a good estimate of the theoretical sampling 


distribution of the population. 


Bootstrapping only works if the sample being used 
has been collected properly and that the sampling 
technique ensures that the sample is random, the 


sample is representative of the population, and the 
sample size is large enough. There are no set rules 
on how big the sample needs to be, but for 
bootstrapping the bigger the better. 


Central Limit Theorem - MRU - C Lemieux 
Introduction to the key properties of the central 
limit theorem for the mean and proportions. 


Another way to determine what the sampling 
distribution looks like is by using theory. The main 
theory that helps us understand the characteristics 
of the sampling distribution is called the central 
limit theorem. 


The central limit theorem is an incredibly useful and 
powerful theorem. The theorem tells us about the 
distribution of many different sampling 
distributions. But be careful! The central limit 
theorem cannot be applied always and only applies 
to sampling distributions. 

This formula assumes the population is infinite or 
very large. If this is not the case, then the formula is 
As the population size (N) increases, N—nN-1 
approaches 1 and no longer affects the standard 
error. 

The images/figures that follow were generated 
from David Lane's sampling distribution applet that 
is part of the OnlineStatBook project 

Online Statistics Education: A Multimedia Course of 
Study (http://onlinestatbook.com/). Project Leader: 
David M. Lane, Rice University. 

Parent populationSampling distribution for Figure 1 
for samples of size 2Sampling distribution for Figure 
1 for samples of size 5 Sampling distribution for 
Figure 1 for samples of size 10 Sampling 


distribution for Figure 1 for samples of size 16 
Sampling distribution for Figure 1 for samples of 
size 20 Sampling distribution for Figure 1 for 
samples of size 25 


The central limit theorem for the 
sampling distribution for sample means 


The sampling distribution for the sample means 
comes from a parent population that is comprised of 
quantitative data. Random samples of size n are 
taken from the parent population and the sample 
mean is calculated for each sample. What will the 
distribution of the sample means look like? That is, 
what is the shape of the distribution of sample 
means, where are the sample means centred, and 
what is the sampling variability? 


The following refers to the theoretical sampling 
distribution for the sample means. Further, when 
sample size is mentioned, it is referring to the size 


of the sample taken from the population. That is, it 
is not referring to how many different random 
samples have been taken. 


Where are the sample means centred? 


As the sample means are estimating the population 
mean, it makes sense that the sample means are 
centred around the population mean. 


In the previous section, we saw the right skewed 
parent population in Figure 1. The population mean 
of that parent population is 8.08. Notice that the 
empirical sampling distributions shown in Figures 5 
and 6 are both centred around 8.08. 


In general, the mean of the theoretical sampling 
distribution for the sample means equals the 
population mean. 


Ux =yx 


The variable for the sample means is x . That is 


hy the subscript for the mean of the sample 
means (ux ) has changed. 


What is the sampling variability? (or what is the 
variation in the sampling distribution) 


Based on the law of large numbers, the sampling 
variability of the sample means will decrease as the 
sample size increases. As the sample size increases, 
the sample means will become better and better 
estimates of the population mean and, therefore, 


there will be less variability between them. That is, 
there will be more variability between the sample 
means for samples of size 2, then there will be for 
samples of size 30. 


Just like we can measure variability for individual 
data values, we can also measure variability for 
sample means. We will use the standard deviation to 
measure the sampling variability. The standard 
deviation of the sampling distribution for sample 
means is called the standard error of the sample 
means. It is found with the following formula 
[footnote] : 

ox =onN-nN-1 

ox =on 


What is the shape of the distribution? 
This is actually a really interesting question. 
Suppose the parent population looks like this 


[footnote]: 
Parent population (can be changed with the mouse) 


What will the sampling distribution for sample 
means look like? 


Here’s the answer: 


¢ If the parent population is normal, then the 
sampling distribution for sample means will be 
normal. Always. 

* As the sample size of the samples being taken 
from the parent population increases, the more 
normal the sampling distribution for sample 
means will become. 


Since the population in Figure 1 is not normally 
distributed, then we would expect the sampling 
distribution will not be normal for smaller sample 


sizes, but will be normal for larger sample size. 


Distribution of Means, N=2 
11952 
9960 
7968 
5976 


For each of these empirical sampling distributions, 
100,000 samples were taken of size n. Therefore, 


e can be very confident that the empirical 
sampling distributions are good representations of 
the theoretical sampling distributions. 


Distribution of Means, N=5 


Distribution of Means, N=10 


I! 
Distribution of Means, N=16 


0 


Distribution of Means, N=20 


12070 
9656 


7242 


Distribution of Means, N=25 : 


Figure 1 (the parent population) is not even close to 
being normal, but notice that as the sample size 


increases, the sampling distribution for sample 
means gets closer and closer to being normally 
distributed! 


In general, the closer the population is to being 
normally distributed, the “faster” the sampling 
distribution gets closer to normal. Here “faster” 
means for a smaller sample size. 


The central limit theorem states that regardless of 
the shape of the population, if the sample size is 


greater than 30, the sampling distribution will be 
approximately normal. 


Measure Populatio1 Sample Sampling 
distribution 
for the 
sample 

Mean reed 7 er mem 

Standard Ox Sx ox =on 

deviation (standard 


error) 


Summary of measures 
Empirical sampling distributions for sample 
proportions 


The central limit theorem for the 
sampling distribution for sample 
proportions 


The sampling distribution for the sample 
proportions comes from a parent population that 
satisfies the criteria of the binomial distribution. 
Random samples of size n are taken from the parent 
population and the sample proportion is calculated 
for each sample. What will the distribution of the 
sample means look like? That is, what is the shape 
of the distribution of sample proportions, where are 
the sample proportions centred, and what is the 
sampling variability? 


The sampling distribution for sample proportions 


has similar characteristics as the sampling 
distribution for the sample means. 


Where are the sample proportions centred? 


They are centred around the population proportion. 


What is the sampling variability? 


It decreases as the sample size increases. 


What is the shape? 


The shape of sampling distributions of the sample 
proportions also becomes normal. Unlike for sample 
means though, the normality is not based on sample 
size, but is based on the number of successes ( nz ) 
and failures ( n(1 —7z) ). 


To illustrate, here are the empirical sampling 
distributions for proportions for various population 
proportions. The sample size is 100 in each case and 
the number of samples taken is 10,000. 
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In Figure 8 a, n =100 and x = 0.01. Therefore, the 
number of successes is 1 and the number of failures 
is 99. The sampling distribution is skewed to the 
right. 


In Figure 8 b, n =100 and x = 0.20. Therefore, the 


number of successes is 20 and the number of 
failures is 80. The sampling distribution is 
approximately normal. 


In Figure 8 c, n =100 and x = 0.60. Therefore, the 
number of successes is 60 and the number of 
failures is 40. The sampling distribution is 
approximately normal. 


In Figure 8 d,n =100 and x = 0.96. Therefore, the 
number of successes is 96 and the number of 
failures is 4. The sampling distribution is skewed to 
the left. 


In general, the shape of the sampling distribution 
for sample proportions is approximately normal if 


the number of successes and the number failures 
are both at least 5. 


If the sampling distribution for sample proportions 
is normal, we can find probabilities for the 
distribution using two methods. The first method is 
using the binomial distribution. The second method 
is the normal distribution. This might seem a bit 
strange as the binomial distribution is for discrete 
random variables and the normal distribution is for 
continuous random variables. In reality, we use the 


normal distribution to approximate probabilities for 
the sampling distribution for sample proportions. 
This is called the normal approximation to the 
binomial distribution. To get the exact probability, 
one would need to use the binomial distribution. 
But this can be cumbersome when the sample sizes 
are very large (e.g. 1000). Therefore, using the 
normal distribution can be beneficial, especially 
because it gives very accurate approximations. In 
example 6.4 below we will investigate this further. 


Further when we begin to do inferential statistics, 
we won’t know the population proportion 
(otherwise inferential statistics wouldn’t be 
necessary). Since we won’t know x it will hard to 
use the binomial distribution. Therefore, we use the 
normal approximation to the binomial distribution 
instead. 


If we use a normal approximation to the binomial 
distribution, we need to know the mean and 
standard deviation of the sampling distribution. 


The mean of the sampling distribution for sample 
proportions is the population proportion. 


Hp° =% 


The standard deviation of the sampling distribution 
for sample proportions (or the standard error of 
sample proportions) is found using the following 
formula: 


op” =7x(1—7)n 


A series of examples - Calculating probabilities for 
sampling distributions - MRU - C Lemieux 

Four examples that show how to calculate and 
interpret probabilities found for sampling 
distributions 


If a sampling distribution is normally distributed, 
then we can find probabilities for the sampling 
distribution using the normal distribution just like 
we did in Chapter 5. 


The z-score for the sampling distribution for sample 
means would be: 
Z = X —ux ox = X —pxoxn 


The z-score for the sampling distribution for sample 
proportions would be: 
Z=p° —-up’ op’ =p’ -xznQ—z)n 


The Old Baldy Tire Company is introducing a 
new steel belted radial tire. Old Baldy's 
engineering department has estimated that the 
mean life of this tire will be 50,000 km and that 
the standard deviation of these tires will be 
10,000 km. Suppose a large number of random 
samples of 100 tires is taken. The shape of the 
population distribution is unknown. 


1. Can we assume the distribution of the 
mean life of these tires will be normal? 


Explain. 

. Regardless of your result in a), assume that 
we are dealing with a normal distribution. 
Find the probability that the mean life of a 
random sample of 100 tires is less than 
49,000km. 

. A competitor of Old Baldy's takes a 
random sample of 100 tires and finds their 
mean life to be 49,000 km. Based off of 
this data, they claim that the engineering 
department of Old Baldy's has exaggerated 
the mean life of their new tires. Do you 
support the competitor's claim? Explain. 


. Yes. As the sample size is greater than 30 
(it is 100), we can assume that the 
sampling distribution of the sample mean 
lifespan of the tires is normally distributed 
regardless of the shape of the sampling 
distribution due to the central limit 
theorem. 

. x = mean lifespan of 100 Old Baldy tires, 
ux =ux = 50,000,0x =on = 
10,000/10 = 1000. Since we know that 
the data is normally distributed, we can 
use a computer program to calculate the 
probability P(X < 49,000) . From the 
computer program, we get P(X < 
49,000) = 15.87% 

. No. The probability that a random sample 


of 100 Old Baldy tires has a mean lifespan 
of 49,000km is 15.87% (assuming Old 
Baldy’s claim). This means that this event 
is likely to occur (as it is greater than 
10%), under the assumption that the tires 
last on average 50,000 km, and does not 
provide sufficient evidence against Old 
Baldy’s claim. 


The maintenance manager at a popular 
mountain resort is trying to determine if the 
aging gondola is in need of some renovation— 
or perhaps outright replacement. Right now, 
the maximum load of the gondola is 900 
kilograms or 12 persons. The manager knows 
that the average weight of North Americans has 
been on the rise for several years and wishes to 
test what the probabilities might be of this 
gondola exceeding the maximum capacity. 


Since the operators don’t currently look at 
gender—just numbers—the manager is 
concerned about what might happen if the 
worst-case scenario were to occur: 12 large 
adult males were allowed on the gondola at the 
same time. 


To investigate this further the manager did 
some research into the current average weight 
of adult males and discovered that it is about 


80 kilograms. He also knows that adult weight 
tends to be normally distributed by gender, 
with a standard deviation for males of about 12 
kilograms. 
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Given this information, he first wants to 
know what the individual weight 
allowance is (i.e. the per person average) 
that the gondola can withstand. 


. He also wants to know how likely is it that 


the individual weight of any randomly 
selected male will exceed the individual 
weight allowance calculated above. 


. Finally, he wants to know how likely it 


would be that the average weight of a 
sample of 12 adult males would exceed the 
average individual weight allowance. 


. Based on your answers, do you think the 


manager should renovate the gondola? Is 
there any further information that the 
manager would need? 


. 900kg/12 people = 75kg/person 
. Since this is about an individual, we will 


use Lx and ox. As stated in the question, 
we know that the population is normally 
distributed. From this, a computer program 
calculated that P(X=75) = 66.15%, where 
X is the weight of an individual person on 
the gondola. 


3. Now, we are being asked about the mean 
for 12 people. Therefore, this question is 
about finding a probability for the 
sampling distribution for sample means. 
Therefore, we will use x = mean weight 
of 12 people, ux =ux = 80,0x =on = 
12/3.46 = 3.46. Since the population 
distribution is normal, we know that the 
sampling distribution will also be normal 
(regardless of the sample size). Therefore, 
we can use a computer program to 
calculate the probability P(X > 75) . We 
get 92.55%. 

4. The probability found in c) is the 
probability that the average mass of 12 
adult males will exceed the maximum 
individual weight for the gondola. The 
next question is “how likely is it that there 
will be 12 adult males on the gondola?” 
The manager should do further research to 
determine this before making a decision. 
While waiting for the results, the manager 
should implement a policy where any large 
groups of males are broken up and are 
required to take the lift in separate groups. 
Le. break up a group of 12 males into two 
groups of 6 males. 


The city of Montreal has an extensive bike lane 
system. In fact, it is one of the largest in North 


America. But many cyclists find that even with 
all of the bike lanes, it is still hard to get around 
the city on a bike. In particular, there are many 
lanes that run east/west, but few that run 
north/south. Thus, they are encouraging the 
city council to focus less on adding lots of 
kilometers to the system, but instead making 
sure that the current system properly connects 
all parts of the city. 


The city council will only go forward with this 
idea if at least 66% of the residents support 
focusing on connecting the system rather than 
expanding the system. 


Suppose that 62% of residents do support 
connecting the system rather than expanding it. 
What is the probability that a random sample of 
1000 residents will have a sample proportion of 
at least 66%? 


1. Find the above probability using the 
binomial distribution. 

2. Find the above probability by using the 
sampling distribution for sample 
proportions. 

3. Compare the two answers. Do they give 
similar answers? 

4. Based on your answers, do you think that 
it is possible that the city of Montreal will 
choose to focus on connecting the bike 


path system? 


1. Since we are using the binomial 
distribution, we are being asked to find the 
probability that at least 660 of the 1000 
people in the poll will want to focus on 
connecting the system. The 660 comes 
from 66% of 1000. In other words, we are 
asked to find the P(X =660) , with n = 
1000 and x = 60%. Using a computer 
program, this yields a probability of 
0.48%. This is found, by highlighting all of 
the values above 660 and including 660. 

2. Since we are using the sampling 
distribution for sample proportions, we are 
asked to find the probability that the 
sample proportion will be at least 66%. In 
other words, we are asked to find the P( p 
> 0.66 ). We can assume the sampling 
distribution for sample proportions is 
normal as the number of successes (nz = 
1000 x 0.62 =620) and the number of 
failures (n(1—2z1) = 1000 x 0.38 =380) 
are both at least five. Therefore, we will 
use the normal distribution to find the 
probability with p p* =a = 0.62 and o p~* 
=n(1—2)n = 0.62(1 —0.62)1000 = 
0.01475. Therefore, using a computer 
program we find P( p* =0.66 ) = 0.33% 

3. The two probabilities are quite close. They 


na 


are only 0.15% apart. Therefore, the two 
methods give us similar results. 

4. It is unlikely that if the proportion of 
residents that want to focus on connecting 
the bike system is 62% that a poll of 1000 
people would result in a sample proportion 
of 66%. Therefore, it is unlikely that the 
city of Montreal will chose to focus on 
connecting the system. 


Video games are gaining more and more 
popularity. Children often try to convince their 
parents to buy games even when they are not 
appropriate. For example, they may want to 
play a very violent game that is not appropriate 
for their age group. To help parents out, video 
games have rating categories to suggest age 
appropriateness. But how aware are parents of 
these categories? 


To investigate this, you conduct a survey of 
Canadian families that have young children 
who play video games. You show parents three 
video game covers that have the category rating 
clearly marked on it. You then ask the parents 
whether the games would be appropriate for 
children and why. If the parent correctly 
identifies which games are appropriate for their 
children and refers to the ratings in making 
their choice, you categorize the parents as well 


informed. 


Suppose that we want to use your results to 
justify the claim that less than 30 percent of 
parents are well informed about video game 
ratings. In your random sample of 1000 
parents, you actually found that 27 percent of 
the parents that you polled were well informed 
about video game ratings. 


1. Assuming that the proportion of parents 
that are well informed about video game 
ratings is 30%, what is the probability that 
you would observe a sample proportion of 
less than 27%. Use the normal 
approximation of the binomial distribution 
to find your answer. 

2. Based on your results, do you believe that 
this is enough evidence to suggest that less 
than 30% of parents are well informed 
about video game ratings? Explain your 
answer. 


1. Since we are using the sampling 
distribution for sample proportions, we are 
asked to find the probability that the 
sample proportion will be at most 27%. In 
other words, we are asked to find the P( p * 
<0.27 ). We can assume the sampling 
distribution for sample proportions is 
normal as the number of successes ( and 


the number of failures (( nat = 1000 x 0.30 
= 300) and the number of failures ( 
n(1—s) = 1000 X0.7 =700) are both at 
least five. Therefore, we will use the 
normal distribution to find the probability 
with up p* =x = 0.30 and o p* 
=n(1—2)n = 0.3(1 —0.3)1000 = 
0.01145. Therefore, using a computer 
program we find P( p* <0.27 ) = 0.44%. 
2. Since the probability that we would 
observe a sample proportion of 27% 
(assuming a population proportion of 30%) 
is 0.44%, it is very unlikely we would have 
observed this evidence if the assumption is 
true. Therefore, it is more likely that the 
population proportion is less than 30%. 
Thus there is enough evidence to suggest 
that less than 30% of parents are well 
informed about video game ratings. 


Practice questions 


The following practice questions are from Lyryx 
Learning, Business Statistics I -- MGMT 2262 -- Mt 
Royal University -- Version 2016 Revision A. 
OpenStax CNX. Sep 8, 2016 http://cnx.org/ 
contents/ 
f3aefa9e-58d2-41ea-969f-04dc2cb04c82@5.5 


Use the following information to answer the next ten 
exercises: A manufacturer produces 25-pound lifting 
weights. The lowest actual weight is 24 pounds, and 
the highest is 26 pounds. Each weight is equally 
likely so the distribution of weights is uniform. A 
sample of 100 weights is taken. The standard 
deviation is 0.58 pounds. 
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3. 


. What is the distribution for the weights of 


one 25-pound lifting weight? What is the 
mean and standard deivation? 


. What is the distribution for the mean 


weight of 100 25-pound lifting weights? 


. Find the probability that the mean actual 


weight for the 100 weights is less than 
24.9. 


. Uniform with a mean of 25 and a standard 


deviation of 0.58 pounds. Remember when 
a distribution is uniform all of the values 
are equally likely. Therefore the mean will 
be halfway between the lowest value (24) 
and the highest value (26). 

Normal with a mean of 25 and a standard 
deviation of 0.0577 

0.0416 


Find the probability that the mean actual 


weight for the 100 weights is greater than 25.2. 


0.0003 


Find the 90th percentile for the mean weight for 
the 100 weights. 


25.07 


Suppose that the distance of fly balls hit to the 
outfield (in baseball) is normally distributed 
with a mean of 250 feet and a standard 
deviation of 50 feet. We randomly sample 49 
fly balls. 


1. What is the probability that the 49 balls 
traveled an average of less than 240 feet? 

2. Find the 80th percentile of the distribution 
of the average of 49 fly balls. 


1. 0.0808 
2. 256.01 feet 


According to the Internal Revenue Service, the 
average length of time for an individual to 


complete (keep records for, learn, prepare, 
copy, assemble, and send) IRS Form 1040 is 
10.53 hours (without any attached schedules). 
The distribution is unknown. Let us assume that 
the standard deviation is two hours. Suppose 
we randomly sample 36 taxpayers. 


1. Would you be surprised if the 36 taxpayers 
finished their Form 1040s in an average of 
more than 12 hours? Explain why or why 
not in complete sentences. 

2. Would you be surprised if one taxpayer 
finished his or her Form 1040 in more than 
12 hours? In a complete sentence, explain 
why. 


1. Yes. I would be surprised, because the 
probability is almost 0. 

2. No. I would not be totally surprised 
because the probability is 0.2312 


Suppose that a category of world-class runners 
are known to run a marathon (26 miles) in an 
average of 145 minutes with a standard 
deviation of 14 minutes. Consider 49 of the 
races. Let X — the average of the 49 races. 


1. Find the probability that the runner will 
average between 142 and 146 minutes in 


NO 


these 49 marathons. 


. Find the 80th percentile for the average of 


these 49 marathons. 


. Find the median of the average running 


times. 


. 0.6247 
. 146.68 
. 145 minutes 


Determine which of the following are true and 
which are false. Then, in complete sentences, 
justify your answers. 
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When the sample size is large, the mean of 
X — is approximately equal to the mean of 
X. 


. When the sample size is large, X — is 


approximately normally distributed. 


. When the sample size is large, the standard 


deviation of X — is approximately the same 
as the standard deviation of X. 


. True. The mean of a sampling distribution 


of the means is approximately the mean of 
the data distribution. 


. True. According to the Central Limit 


Theorem, the larger the sample, the closer 


the sampling distribution of the means 
becomes normal. 

. The standard deviation of the sampling 
distribution of the means will decrease 
making it approximately the same as the 
standard deviation of X as the sample size 
increases. 


Introduction to Hypothesis Testing 
Introduction to chapter on hypothesis testing 


Inferential statistics 

In the first unit, we examined descriptive statistics 
where we described the data using visual and 
numerical summaries. If we made any conclusions, 
it was only about the specific data we were looking 
at. Now, we are going to look specifically at sample 
data and use it to make conclusions about the 
population as a whole. That is, instead of simply 
describing the data, we are going to make inferences 
about the population from a sample. We will do this 
in two ways: 1) Hypothesis tests and 2) Confidence 
intervals. 


Hypothesis tests 

We actually have already done hypothesis tests, but 
we did so informally. Back in Chapter 4 when we 
evaluated evidence, we actually were doing 
hypothesis tests (or "null hypothesis significance 
tests"). In particular, a hypothesis test begins with 
making an assumption (i.e. the null hypothesis) 
which is often the opposite of what we want to 
show. Then we collect evidence by collecting a 
random and representative sample (to the best of 
our abilities). Then we determine how well the 
evidence aligns with the assumption. If the evidence 
is very different from the assumption, we start to 
question the assumption. If the evidence leads us to 
question the assumption, we reject the null 


hypothesis. If the evidence does not make us 
question the assumption, we do not reject the null 
hypothesis. 


This process of evaluating evidence should sound 
familiar as it is what we did in Chapter 4. As we are 
familiar with the process, we want to learn the 
following in this chapter: 


* The formal procedures of performing 
hypothesis tests. 

* How to choose and justify the appropriate 
model to use when performing a hypothesis 
test. 

* The correct terminology for hypothesis tests. 

¢ What type I and II errors are and why they are 
important. 


Confidence intervals 

A hypothesis test only tells you whether you should 
reject or not reject the null hypothesis. For example, 
suppose you are testing that the mean number of 
hours that people sleep at night is less than 8 hours. 
Your final conclusion will either indicate if there is 
enough evidence or not enough to support that the 
mean number of hours that people sleep at night is 
less than 8 hours. But if your conclusion is that 
there is enough evidence, the hypothesis test does 
not tell you by how much it is less than 8 hours. Are 
people getting, on average, 7.5 hours of sleep per 
night or only 4 hours? The hypothesis test gives no 


indication. 


This is where a confidence interval comes in. A 
confidence interval provides an estimate of the 
population parameter based on sample data. So 
unlike a hypothesis test, the confidence interval can 
indicate around how many hours of sleep people 
are, on average, getting per night. This can be very 
useful. If the estimate is that people are getting 
somewhere around 7.5 hours of sleep on average, 
we know that, though it is different from 8 hours, it 
is not that different. But if the estimate says that 
people are getting only 4 hours, on average, then we 
should be worried. Thus, confidence intervals and 
hypothesis tests both give us useful information and 
both are needed to get a complete picture of the 
situation. 


For confidence intervals, we want to learn the 
following: 


¢ Understand what a confidence interval is. 

¢ How to find and interpret a confidence interval. 

* Define the factors that impact the width of the 
confidence interval. 

¢ Understand what a confidence level is. 


Review 


This unit takes many of the things we learned from 
the previous units and brings them all together. 
Therefore, prior to starting this unit, you'll want to 
be familiar with the following key terms/ideas and 
their symbols (if relevant). 


statistic vs. parameter 

sample mean vs. population mean (and their 
related symbols) 

sample standard deviation vs. population 
standard deviation (and their related symbols) 
sample proportion vs. population proportion 
(and their related symbols) 

sampling distribution of sample means and the 
central limit theorem 

binomial distribution 


Overview of hypothesis testing - C Lemieux 
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Goal of section 


The goal of this section is to familiarize ourselves 
with the general process of hypothesis testing and to 
learn key terms. The specific type of hypothesis 
testing we are examining is called null hypothesis 
significance testing. In later sections we will look 
at specific hypothesis tests. 


General process of hypothesis testing 


Hypothesis tests are often used by researchers to 
evaluate their evidence against a certain 
assumption. The basic steps that they go through are 
as follows: 


1. Come up with hypothesis: The fundamental 
part of doing a hypothesis test is having a 
hypothesis to investigate. The hypothesis we 
are investigating is usually the alternative 
hypothesis, while its opposite is called the null 
hypothesis. 

2. Gather evidence: Once you have a hypothesis, 
you want to gather evidence to investigate it. 


In statistics, the evidence is sample data. 

3. Determine the level of significance: As 
mentioned in the introduction, when doing 
hypothesis testing, we can either reject or not 
reject the null hypothesis. To determine which 
decision to make, we need a threshold between 
the evidence is strong enough to make us 
question our assumption (the null hypothesis) 
or not. We call the threshold "the level of 
significance". 

4. Evaluate the evidence: To evaluate the 
evidence, we need to determine how unlikely it 
is we observed our evidence (or even better 
evidence against the null hypothesis), assuming 
the null hypothesis is true. We then use our 
level of significance (or threshold) to decide 
whether to reject or not reject the null 
hypothesis. To evaluate the evidence we need 
to use a model or distribution. We will be using 
the distributions we've already learned in class 
(i.e. the standard normal distribution and the 
binomial distribution) and some new ones. 

5. Make a conclusion: Once we've made our 
decision, we need to communicate our 
conclusion about the original hypothesis in a 
way that people unfamiliar with statistics may 
understand. 


Though the steps have presented in a certain order, 
they don't necessarily have to follow the order 
provided. For example, choosing the level of 


significance could be done in the first, second or 
third step. But it would not be appropriate to do it 
after evaluating the evidence as that could result in 
the researcher biasing the results by choosing a level 
of significance that allows them to make the 
decision they want to make. 


But some steps have to follow other steps. For 
example, you cannot evaluate the evidence without 
first knowing the hypotheses, gathering the 
evidence, and choosing the level of significance. 


For the remainder of this section, we will discuss the 
process of hypothesis testing in more detail. 


Null and alternative hypotheses 


The alternative hypothesis is sometimes called the 
research hypothesis. We use the symbol HA or H1 to 
represent the alternative hypothesis. The alternative 
hypothesis is often the hypothesis that we are 
investigating. 


The null hypothesis, on the other hand, is the 
opposite of the alternative hypothesis. It is 
represented with the symbol HO , which is 
pronounced "H nought". 


For example, suppose you are investigating the 
hypothesis that, on average, people sleep less than 8 


hours a night. The alternative hypothesis (aka the 
research hypothesis) would be "on average, people 
sleep less than 8 hours a night". The null hypothesis 
is the opposite of that. This means that instead of 
claiming that the average is less than 8, the null 
hypothesis states that "on average, people sleep at 
least 8 hours a night". 


The alternative hypothesis is the statement that 
something has changed, is different, is not the same. 
While the null hypothesis is the statement that 
nothing has changed, nothing is different, and it is 
the same. Due to this, the null hypothesis always 
needs to include an equality to indicate that 
something is the same. Thus, the null hypothesis, 
when written symbolically, usually includes either 
<,=,=. The alternative hypothesis, on the other 
hand, when written symbolically usually includes 
a 


So, for the example about sleep above, we can write 
the alternative hypothesis symbolically as HA: u<8 
. Similarly we can write the null hypothesis as HO : 
u=>8. Here, up is the population mean amount of 
sleep people get at night. 


Notice that the null and alternative hypothesis are 
about u, the population mean. When doing 
hypothesis tests, we are always testing something 
about a parameter (a feature of the population). 
Therefore, the hypotheses are always written about 


a population parameter (e.g. ut or zt ). 


Tail of the test 


In a hypothesis test, we are evaluating the evidence 
by determining how unlikely it is we observed our 
evidence (or even better evidence against the null 
hypothesis), assuming the null hypothesis is true. 
The "even better evidence against the null 
hypothesis" is defined by the tail of the test. In 
particular, if we have a sample statistic, the tail 
would be the statistics that would provide better 
evidence against the null hypothesis. For example, if 
our sample mean number of hours of sleep was 7.5 
hours, then the tail would be any sample mean less 
than 7.5 hours because getting less than 7.5 hours 
sleep (e.g. 7.4, 7, 6.2 hours) is even better evidence 
against the assumption that people get 8 hours of 
sleep, on average, per night. Therefore, sample 
means to the LEFT of the sample mean found in our 
evidence would be in the tail. Hence, we would call 
this type of test a left-tailed test . The tail of the 
test is usually found by examining the direction of 
the alternative hypothesis. 


* If HA has a "less than" symbol, then it is a left 
tailed test. 

¢ If HA has a "not equal to" symbol, then it is a 
two tailed test. 

¢ If HA has a "greater than" symbol, then it is a 
right tailed test. 


Assumption to start hypothesis test 

Since the alternative hypothesis is usually the 
hypothesis we are trying to investigate, it would be 
inappropriate to assume the alternative hypothesis 
is true. If we did this, we would be introducing 
confirmation bias. Instead, we want to assume the 


opposite of what we want to show to be true and, 
under this assumption, determine if the evidence is 
strong enough to make us question our assumption. 
Thus, a fundamental idea is that we start the 
hypothesis test under the assumption the null 
hypothesis is true. This is why it is called a null 
hypothesis significance test. 


Analogy: Murder trial 

Hypothesis testing is analogous to criminal trials. 
When a prosecutor puts someone on trial for 
murder, they do so because they believe the person 
is guilty (the alternative hypothesis). But in our 
criminal justice system, we assume the person is 
innocent (the null hypothesis) until enough 
evidence is presented to make us question that 
assumption. Or, in other words, until the jurors 
believe the person is guilty beyond a reasonable 
doubt. If jurors started a trial with the assumption of 
guilt, all of the evidence would be coloured by that 
assumption. Meaning, they would be predisposed to 
see any piece of evidence as evidence of guilt. 
Instead, we want jurors to assume that the 


defendant is innocent and be persuaded by the 
evidence. This is the same idea in hypothesis 
testing, a researcher believes the claim they are 
making is the right one (i.e. they believe the 
alternative hypothesis is right), but they start off by 
assuming the opposite of what they want to show 
(i.e. the null hypothesis is true) to avoid bias. 


Hypothesis testing and science are intimately 
related. One of the defining features of science is the 
principal of falsification, which essentially states 
that a hypothesis has the potential to be falsified 
(not that it has to be, only that it can be). Without 
this principal, we would only look for evidence that 
supports our beliefs. In hypothesis testing, we start 
from the assumption of the opposite of what we 
want to show to avoid this tendency to only look for 
evidence that supports our beliefs. If you are 
interested, watch this video by Crash Course that 
explains the difference between science and 
pseudoscience: https://www.youtube.com/watch? 
v=-X8xXflOJdTQ. In particular, pay attention to the 
part where the narrator explains the difference 
between the types of experiments Freud was doing 
(which involved confirmation bias) and the ones 
Einstein were doing. 


e want to test whether the mean GPA of students 
in Canadian universities is different from 2.0 (out 


of 4.0). The null and alternative hypotheses are: 
HO:u = 2.0 


We want to test whether the mean height of 
eighth graders is 165 centimeters. State the 
null and alternative hypotheses. Fill in the 


correct symbol (=, =, =, <, s, >) for the 
null and alternative hypotheses. 


1. Ho: p _ 165 cm 
PAS gles de ee NakerrG 00) 


1. Ho: = 165 cm 
2. Ha: p # 165 cm 


Try It 


On a state driver’s test, about 40% pass the 
test on the first try. We want to test if more 
than 40% pass on the first try. Fill in the 
correct symbol (=, +, =, <, s, >) for the 


null and alternative hypotheses. 


1. Ho: 1 _ 0.40 
2. Ha: 1 _ 0.40 


Recall that the Greek letter pi ( 1 ) represents 
a population proportion. 


1. Ho: x = 0.40 
2a wt = OA 


Gather evidence 


The evidence in a hypothesis test is sample data. To 
do a hypothesis test, we need to randomly collect 
our data and we want our sample to be 
representative of the population. Further the results 
of our test have more veracity if the sample size is 
larger. We have discussed in Unit 1 how to collect 
data. 


Once we have our data, we will generate 
appropriate descriptive statistics both to help others 
know what our data says and also to determine if 
our original hypothesis is even viable. For example, 


suppose we gather data to investigate if, on average, 
people sleep for less than 8 hours in a day. Then, 
when we summarize that data, we find that the 
mean sleep time of our sample is 10 hours. That 
certainly suggests that our original hypothesis is not 
correct and we may want to revise our ideas. 


When doing a hypothesis test, we are always 
gathering sample data. If we could gather data 
about the whole population we wouldn't bother 
doing a hypothesis test. Remember that a hypothesis 
test is a type of inferential statistics where we use 
sample data to make a conclusion about the 
population as a whole. Thus, if we had the 
population data, a hypothesis test would not be 
necessary. 


Summary 
The alternative and null hypotheses are always 


about the population parameter. When we gather 
evidence it is always sample data. 


Analogy: Murder trial 

Going back to the analogy about a murder trial, the 
evidence might be a strong motive, a smoking gun, 
no alibi, a witness, physical evidence etc. But all of 
this evidence has to be relevant to the trial (i.e. a 


smoking gun from a completely different murder 
trial would not be evidence for this murder trial). In 
hypothesis testing, the evidence is always the 
sample data from the relevant population. 


Determine the level of significance 


To reject HO or to not reject HO , that is the 
question. 


The level of significance is the threshold between 
when we reject HO and when we do not reject HO. 
The symbol we use for the level of significance is 
the Greek letter alpha: a . 


Analogy: Murder trial 

In a murder trial, the threshold between rejecting 
innocence and, thus, finding the defendant guilty is 
"beyond a reasonable doubt". But "beyond a 
reasonable doubt" is a nebulous idea. Two different 
juries could be presented with the exact same 
evidence and one could decide to find the defendant 
guilty while the other might find them not guilty. 
That is, the threshold between rejecting the 
assumption of innocence is ill-defined. In hypothesis 
testing, we want the threshold to be more clearly 
defined so that if two researchers were presented 
with the same sample and had the same hypothesis, 
they would come to the same conclusion. 


To review, we begin our hypothesis test by 
assuming the null hypothesis is true. We then collect 
sample data in an effort to determine how unlikely 
it is that we observed our sample data (or even 
better evidence against the null hypothesis), 
assuming the null hypothesis is true. If it unlikely 
that we observed the evidence, then we will 
question the assumption (i.e. the null hypothesis) 
and this will lead to us reject the null hypothesis. 
The threshold (aka the level of significance) is when 
we would deem the evidence to be unlikely (under 
the assumption). The idea of unlikely is a 
probabilistic idea and thus we define our level of 
significance by a probability. In particular, an event 
is unlikely is the probability of it occurring is 
small. Therefore, we want the threshold to be a 
small percentage (or probability). Due to this, the 
level of significance is usually chosen to be a value 
between 1% and 10%, but most studies choose 1% 
or 5% as their level of significance. We'll discuss 
why later when we talk about errors in hypothesis 
testing. 


Evaluate evidence 


To summarize what has happened so far, we've 
determined our alternative hypothesis, which 
defined our null hypothesis (as the two hypotheses 
are opposites of each other). We have gathered our 
evidence (i.e. sample data) and chosen the level of 


significance that will cause us to reject HO . Now we 
need to bring it all together by evaluating the 
evidence. 


There are four steps to evaluating the evidence: 


. Choose an appropriate model (or distribution). 

. Find the test statistic. 

. Calculate the p-value 

. Compare the p-value to the level of significance 
to decide whether to reject the null hypothesis 
or not. 


BRWNEH 


Choose an appropriate model (or distribution) 


First things first, to evaluate the evidence we will do 
so by determining the probability that we observed 
our sample data (or even better evidence against the 
null hypothesis) assuming the null hypothesis is 
true. We will then determine whether this 
probability suggests that the sample data is unlikely 
to occur under the assumption (that the null 
hypothesis is true). To find a probability, we want to 
use a distribution (e.g. standard normal distribution, 
binomial distribution), which we also call a model. 
Thus, the first step of evaluating evidence is 
determining what probability model or distribution 
we will use to find the probability. To do this, we 
have to consider what kind of data we are 
examining (e.g. categorical or quantitative) and 
whether the data satisfies the conditions of the 


distribution. 


For example, suppose we are examining whether 
people, on average, sleep less than 8 hours a day. To 
investigate this situation, we would collect data on 
how many hours a person sleeps. Therefore, our 
data would be continuous and quantitative. Further, 
we want to notice that we are comparing a sample 
mean to hypothesized population mean (i.e. 8 hours 
of sleep). So we are not comparing individual sleep 
hours, but instead comparing means. Therefore, we 
want to use the sampling distribution of sample 
means. For continuous data, we the only 
distribution we know is the normal distribution. 
Therefore, we would need to determine whether it 
would be appropriate to assume the sampling 
distribution of sample means is normal (Hint: You'll 
probably have to refer to the central limit theorem 
in some way). 


Find the test statistic 


The test statistic is a statistic that needs to be found 
to find the probability. For example, suppose we 
know that the sampling distribution of sample 
means is normally distributed, to evaluate our 
sample data we would need to calculate the z-score 
using the formula: (sample mean - hypothesized 
mean)/(standard deviation of the sample means). 
The value that we get from this calculation is the 
test statistic. The test statistic is an example of a 


random variable. 


Usually the test statistic is found by comparing the 
sample statistic with the hypothesized population 
parameter while taking into account variation. The 
comparison of the statistic and parameter is usually 
done by subtraction, while the consideration of 
variation is usually done through division. 


Find the p-value 


We are finally getting there! The p-value is defined 
as the probability we will observe our sample 
statistic (or even better evidence against the null 
hypothesis), assuming the null hypothesis is 
true. Here are a few points about the p-value: 


¢ The smaller the p-value, the more unlikely it 
is that we observed the evidence under the 
assumption the null hypothesis. 

The p-value is a conditional probability, with 
the condition always being the assumption the 
null hypothesis is true. 

Finding the p-value always includes an 
inequality due to the "even better evidence" 
portion. For example, if our sample mean 
number of hours of sleep was 7.5 hours, then 
we would find the probability that people get 
7.5 hours or less (assuming the population 
mean is 8 hours of sleep) because getting less 
than 7.5 hours sleep (e.g. 7.4, 7, 6.2 hours) is 


even better evidence against the assumption 
that people get 8 hours of sleep, on average, 
per night. 


The p-value is found by using the test statistic and 
the distribution (or model). For example, suppose 
that the sampling distribution of sample means for 
the mean number of hours of sleep is normally 
distributed. Then the test statistic is the z-score. 
Suppose, the test statistic for the mean number of 
hours of sleep per night is -2.5, then we would 
determine the p-value by finding P(Z* < -.25) using 
the standard normal distribution, which is 

0.0062 = 0.62%. 


Below, we will look at some of the common 
misconceptions about the p-value. 


Compare the p-value to the level of significance 


As we have already stated, The smaller the p-value, 
the more unlikely it is that we observed the 
evidence under the assumption the null hypothesis.. 
The threshold between "small enough", or when we 
determine that the sample data is unlikely to occur 
under the assumption, is determined by the level of 
significance, a . Therefore, we make our decision as 
follows: 


- If p<a, we reject HO. 
* If p=>a, we do not reject HO. 


What's amazing is that this general rule applies to 
all hypothesis tests that use the p-value to make a 
decision. 


Below, we'll discuss more about what "do not reject 
HO " means and what it does not mean. 


Make a conclusion 


As the last step, we write a sentence that 
summarizes the results by making a statement about 
the alternative hypothesis. For example, for the 
number of hours of sleep research, we would either 
conclude that there is enough evidence or not 
enough evidence to suggest that people sleep, on 
average for less than 8 hours per night. 


Conclusions in hypothesis tests are never 100% 
true or false as we are making an inference about a 
population from a sample. Therefore, it is always 
possible that it an error has been made. Hence, it is 
INEVER appropriate to say that the alternative 


hypothesis is 100% true or to say that the 
hypothesis test proved something. Instead, we can 
only say whether the evidence supports the 
alternative hypothesis or not. We will discuss 
errors in hypothesis testing below. 


Common misconceptions about the p- 
value 


The p-value is the most commonly used way that 
researchers evaluate evidence, but it is also poorly 
understood and many people (including people who 
use it) don't know what it means. This led to the 
American Statistical Association releasing a 
statement on p-values in an attempt to address the 
common errors. You can find the article here: 
https://www.tandfonline.com/doi/ 
full/10.1080/00031305.2016.1154108. Here are 
some common misconceptions: 


Misconception: The p-value is the probability the 
null hypothesis is true. 

Think back to the definition of the p-value. It is a 
probability that determines how (un)likely it is that 
we observed our sample statistic (or even better 
evidence against the null hypothesis) assuming the 
null hypothesis is true. Thus, it is a conditional 
probability that evaluates the sample data (as 
summarized by a statistic) in comparison to an 
assumption. It does NOT measure the likelihood of 
the null hypothesis being true. 


Misconception: The p-value measures the 
probability the sample data by random chance 
Again, this is untrue if we consider the definition of 


the p-value. It is a conditional probability (which is 
missed in the misconception) and it misses the "or 
even better evidence" portion of the p-value. 


Misconception: The smaller the p-value the more 
significant the result 

This is actually one of the limitations of the p-value. 
The p-value only determines if there is a statistically 
significant difference between the sample statistic 
and the assumed population parameter. It does not 
determine how big the difference is. Thus, the p- 
value only tells you that, for example, the mean 
number of hours of sleep that people get per night is 
less than 8. But it does not indicate how much less 
than 8 it is. Even if your p-value is very small, it 
does not make this indication. Like we saw with 
descriptive statistics, no one measure will tell you 
everything you need to know about the situation. 
Therefore, to fully investigate a hypothesis you need 
to consider more than just a p-value. We'll learn 
about confidence intervals in the next chapter, 
which are an additional way to investigate a 
hypothesis. Another way to investigate significance 
of the results are effect sizes, which are beyond the 
scope of this course. 


What does "do not reject the null 
hypothesis" mean? 


Suppose our p-value is 23%. As our level of 


significance can only be between 1% and 10%, 
regardless of our choice of level of significance, the 
p-value is greater than the level of significance. This 
means there is not enough evidence to make us 
question our assumption. Therefore, we do not 
reject the null hypothesis. But what does this mean? 
It actually means very little. When we say "do not 
reject HO ", all we are saying is there is not enough 
evidence to support the alternative hypothesis. It 
says nothing about the truth of the null hypothesis! 
Therefore, saying "do not reject HO " does not mean 
"accept HO " as it only tells us that we don't have 
enough evidence to support HA (which is really not 
telling us much) and it is telling nothing about HO. 
Therefore, this decision is telling us very little. 


Why does do not reject null hypothesis NOT 
mean accept the null hypothesis? 


There are various ways to explain this. Here are 
three of them. 


Before we look at them, it is important to remember 
that HO and HA are opposites of each other. 
Therefore, if HO is shown to be false, HA would then 
be true, and vice versa. 


Analogy: Murder trial 

In a murder trial, the jury start with the assumption 
of innocence. If they reject this assumption, they 
conclude the defendant is guilty. But if they do not 


reject this assumption, they conclude the defendant 
is not guilty. Notice, they do not conclude the 
defendant is innocent. This is because the trial is 
only examining whether they are guilty or not. It is 
not examining their innocence. A "not guilty" verdict 
does not mean the defendant is innocent. It only 
means there is not enough evidence to show the 
defendant is guilty. These are two very different 
things. In a similar manner in hypothesis testing, the 
decision to "not reject HO " only means there is not 
enough evidence to suggest the alternative 
hypothesis is correct. It says nothing about the null 
hypothesis. 


Informal fallacy 

Suppose we do not reject the null hypothesis and we 
used that to conclude that there is evidence for the 
null hypothesis. Then we would have committed the 
informal fallacy (or made a bad argument) called 
arguing from ignorance . The informal fallacy of 
arguing from ignorance is when we arrive at a 
conclusion because there is a lack of evidence to the 
contrary. We say X is true because there is not 
enough evidence to say it is false OR we say X is 
false because there is not enough evidence to say it 
is true. For example, the argument "there is no 
evidence that aliens exist means that aliens do not 
exist" is an argument from ignorance as we are 
saying the absence of evidence of aliens conclusively 
means there are no aliens. It does not allow that we 
simply haven't found the evidence yet. 


How does this relate to "do not reject HO "? If we 
used this conclusion to determine that there is 
evidence for HO (i.e. we say accept HO ) , then we 
would be making an argument from ignorance. In 
particular, the argument would be "since there is not 
enough evidence for HA , then HA must be false." 
Since HA and HO are opposites, this would mean 
that we would conclude accept HO . But this would 
be an argument from ignorance as we are saying an 
absence of evidence for HA is evidence for HO . 
Thus, we are not allowing that we simply haven't 
found the evidence yet (or that we might not ever 
find it). 


In summary, "do not reject HO " does not mean 
"evidence for HO " as doing so would be an 
argument from ignorance (or saying that an absence 
of evidence for X, means X is false and not X is 
true). 


Formal logic 

Two things: 1) If you do not know formal logic, 
ignore this section; 2) Hypothesis testing does not 
follow formal logic as there is no certainty in 
hypothesis testing but there is certainty in formal 
logic. Therefore, this is section should be read with 
the warning of remembering that nothing is 100% 
true in hypothesis testing. This is why you'll see 
"true" and "false" in the explanation below as the 
quotations indicate that we aren't discussing Truth 
but rather whether the evidence suggests something 


is true. 


In formal logic, our null hypothesis can be written 
as a conditional statement: If P then Q. In particular, 
P is the statement the null hypothesis is true (or 
equivalently, our parameter equals some number X) 
and Q is the statement the sample data would 
support the null hypothesis (i.e. the relevant statistic 
would be close to the number X). That is, it is the 
statement: "If the null hypothesis is true, then we 
expect the sample data would support the null 
hypothesis". When performing a hypothesis test, this 
whole statement is considered to be true. 


The truth table for the conditional statement is as 
follows: 


TED than OA 
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As the whole conditional statement is true, the 
situation where P is true and Q is false is not 
possible in this scenario (i.e. the second row does 


not apply to hypothesis testing). Therefore, only 
rows 1, 3 and 4 are possible. 


Situation: Reject HO 

Suppose we gather our evidence and our statistic is 
nowhere near X (i.e. our p-value is small), then the 
statement "the sample data supports the null 
hypothesis" would be "false". Looking at rows 1, 3 
and 4 on the truth table, then the only situation 
where Q is false is in the fourth row. This means, 
that P would also have to be false. In other words, if 
we reject HO , then the evidence suggests that HO is 
false and we can conclude (since HO and HA are 
opposites of each other) that the evidence supports 
HA. 


Situation: Do not reject HO 

Suppose we gather our evidence and our statistic is 
near X (i.e. our p-value is large), then the statement 
"the sample data supports the null hypothesis" 
would be "true". Looking at rows 1, 3 and 4 on the 
truth table, then when Q is true, P could be either 
true (first row) or false (third row). This means, we 
do not know if P is true or false. In other words, if 
we do not reject HO , then we cannot make any 
conclusion about HO as it could be either true or 
false and we don't know which one! 


In short, if we reject HO , we can conclude that HO 
is likely false and therefore, HA is likely true (i.e. 
there is evidence to support HA ). But if we do not 


reject HO , then HO could be either true or false. We 
simply do not know. Therefore, we cannot make any 
real conclusion. The best we can say is that there is 
not enough evidence to support HA , which is not 
really saying anything. 


Summary 

Above, three different ways of explaining why do 
not reject HO does NOT mean that HO is likely true 
were provided. The informal fallacy reasoning is the 
most accurate to the situation we are in as the 
analogy is simply an analogy (not a true reason) and 
formal logic does not entirely apply to this situation 
as hypothesis testing does not deal in certainties. 
The important part to gain from these explanations 
is the idea that do not reject HO is NOT the same 
thing as accepting HO. Further, it is INCORRECT in 
hypothesis testing to state "accept HO". 


hat accept HO looks like 
Suppose for the average number of hours of sleep 
example, we concluded "do not reject HO ". The 
correct conclusion would be "there is not sufficient 
evidence to suggest that, on average, people get 
less than 8 hours of sleep". That is, we are saying 
there is not enough evidence to support HA , but 
we are saying nothing about HO. 
It would be INCORRECT to state: "there is 
sufficient evidence to suggest that, on average, 


people are getting at least 8 hours of sleep," as this 
statement is equivalent to accepting HO or of 
stating that the null hypothesis is likely true. If you 
wrote your conclusion this way, you would be 
committing an informal fallacy of arguing from 


We want to test whether the mean GPA of 
students in Canadian universities is different 
from 2.0 (out of 4.0). The null and alternative 
hypotheses are: 

Ho: u = 2.0 

Ha: wb # 2.0 


Write the conclusion if a) we reject the null 
hypothesis and b) if we do not reject the null 
hypothesis. Finally, write the incorrect 
conclusion of c) accepting the null hypothesis. 


a) There is sufficient evidence to suggest that 
the mean GPA of students in Canadian 
universities is different from 2.0 (out of 4.0). 
b) There is not sufficient evidence to suggest 
that the mean GPA of students in Canadian 
universities is different from 2.0 (out of 4.0). 
c) There is sufficient evidence to suggest that 


the mean GPA of students in Canadian 
universities is 2.0 (out of 4.0). (NOTE: 
REMEMBER THIS IS INCORRECT!!!) 


What if we want to investigate the null 
hypothesis? 


When we are doing research, sometimes we want to 
determine if nothing has changed or if two things 
are the same (i.e. we want to investigate the null 
hypothesis). For example, suppose we want to show 
that feeding babies mushy food or whole food 
makes makes no difference on instances of choking. 
We can investigate this situation using inferential 
statistics, but it would not be appropriate to do so 
using a hypothesis test. This highlights a limitation 
of null hypothesis significance tests: they do not let 
us investigate situations where we want to show two 
(or more) things are the same. Instead we have to 
use other inferential techniques such as confidence 
intervals. 


Type I and II errors 


As mentioned previously, hypothesis testing never 
results in a 100% conclusion about the truth or 
falseness of the alternative hypothesis. Rather, as we 


are using sample data to make a conclusion about a 
population parameter, there is always the possibility 
of error. 


Analogy: Murder trial 

In a murder trial, the jurors do their best job, based 
on the evidence, to arrive at their decision of guilty 
(reject null hypothesis) or not guilty (do not reject 
null hypothesis). But sometimes they make 
mistakes. That is, sometimes their verdict is not 
guilty, when in fact the person is guilty OR their 
verdict is guilty, when in fact the person is innocent. 
The jurors do not intentionally make this err (we 
hope) but instead the evidence they were presented 
with leads them down the wrong path. Further, after 
the trial is over, they don't know if there decision 
was right or wrong. 


Similarly in hypothesis testing, we can make a 
wrong decision not because we intend to but 
because our sample data (evidence) led us down the 
wrong path. When you perform a hypothesis test, 
there are actually four possible outcomes depending 
on the actual truth (or falseness) of the null 
hypothesis Ho and the decision to reject or not. The 
outcomes are summarized in the following table: 


STATISTICAL Hols 
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The four possible outcomes in the table are: 


1. The decision is do not reject Ho when Ho is 
true (correct decision). 

2. The decision is reject Ho when Ho is true 
(incorrect decision known as a Type I error). 
This case is described as "rejecting a good null" 
or a "false positive". 

3. The decision is do not reject Ho when, in fact, 
Ho is false (incorrect decision known as a Type 
II error). This is called "accepting a false null" 
or a "false negative". 

4. The decision is reject Ho when Ho is false 
(correct decision). 


Each of the errors occurs with a particular 
probability. The Greek letters a and B represent the 
probabilities. 


* a = probability of a Type I error = P(Type I 
error) = probability of rejecting the null 
hypothesis when the null hypothesis is true. 
This is called the level of significance! 

* B = probability of a Type II error = P(Type I 
error) = probability of not rejecting the null 


hypothesis when the null hypothesis is false. 
The probability of 1- 8 is called the "power of 
the test". 


As a and £ represent the probabilities of committing 
errors, they should be as small as possible. 


Relationship between a and B 
The probability of committing a specific error is 


related to each other and the sample size effects 
both probabilities. 


For any fixed a , an increase in the sample size 

will cause a decrease in 8B. 

* For any fixed sample size, a decrease in a will 
cause an increase inf. 

* To decrease both a and £ increase the sample 
size. 

* a and £ are inversely proportional to each 

other. In other words, if you make a very 

small, then B will be very big. 


This last point is very important when you are 
considering your choice for a , the level of 
significance. If you make a very small (say 
0.00001%), then the probability of committing a 
type I error will be very small. But the probability of 


committing at type II error will be very high! This is 
why the level of significance is usually chosen to be 
between 1% and 10%. We don't want it to be 
smaller than 1% as that would cause the probability 
of a type II error to be too high and we don't want it 
larger than 10% as the probability of a type I error 
would be too high. Instead we try to balance the 
probabilities. 


Considerations when choosing the level of 
significance 


Once you have set out your null and alternative 
hypothesis, you need to determine how strong your 
sample data must be before you would be confident 
in rejecting the null hypothesis in favour of the 
alternative hypothesis. The required strength of 
evidence is defined by the level of significance (a ). 


Typically values for alpha range from 1% to 10% 
and will vary depending on a number of factors, 
including conventions set by a particular industry or 
discipline and the relative risks of a Type I versus a 
Type II error. In many cases, the choice of alpha 
may be left up to the analyst. Unfortunately, 
without a peer review process, some analysts may 
be tempted to set alpha in a way that will support 
his or her desired conclusion. 


For example, if a pharmaceutical company stands to 
make millions of dollars on a new drug, it obviously 


has a vested interest in offering proof that the drug 
is effective. The null hypothesis is that the drug is 
not effective; and the alternative is that it is. But 
what if the proof, as discovered by several rounds of 
double-blind tests, turns out to be rather weak? This 
would normally lead the researcher to decide not to 
reject the null hypothesis and conclude that the 
sample evidence is insufficiently strong for the drug 
to be considered a success. If this were the 
conclusion, the drug should not be approved as an 
effective treatment. But a company with millions 
already invested in the drug may be strongly 
determined to see it to market, in spite of the test 
results. An unethical approach might be to simply 
move the goal posts to make it easier to reject the 
null hypothesis (i.e. to make the evidence needed to 
reject the null hypothesis weaker). 


These goal posts, of course, are defined by the level 
of significance. In much scientific testing, the level 
of significance is typically set at 1%, which means 
the sample evidence must be very strong before a 
null hypothesis can be rejected. But in this example, 
the pharmaceutical company may move "the goal 
posts", which would mean setting the level of 
significance as high as 10%. This higher level of 
significance allows for weaker evidence to be used 
in support of an alternative hypothesis. 


Thankfully, at least when it comes to 
pharmaceutical testing, there are objective, 


government regulated standards that cannot be 
easily manipulated by vested interests. However, 
there are instances where the researcher is in 
control of choosing the level of significance. When 
this is the case, the choice should be made ethically 
and with an honest consideration of the implications 
of Type I and Type II errors. 


As a final note, the level of significance should 
never be chosen after the sample data has been 
collected and summarized. This would be akin to 
allowing the home team to determine where the 
goal posts are after the game has already begun! 


Suppose the null hypothesis is: Frank's rock 
climbing equipment is safe. What would the type I 
and II errors and their associated probabilities be in 
this situation? Further, what would be an 
appropriate choice for the level of significance? 


Solution 


Type I error: Frank thinks that his rock climbing 
equipment may not be safe when, in fact, it really 
is safe. 


Type II error: Frank thinks that his rock climbing 
equipment may be safe when, in fact, it is not safe. 


a = probability that Frank thinks his rock 
climbing equipment may not be safe when, in fact, 
it really is safe. 


(8 = probability that Frank thinks his rock climbing 
equipment may be safe when, in fact, it is not safe. 


Notice that, in this case, the error with the greater 
consequence is the Type II error. (If Frank thinks 
his rock climbing equipment is safe, he will go 
ahead and use it.) Therefore, we want to minimize 
a Type II error which, since they are inversely 
related, means that we want to maximize a Type I 
error. This means that we would want to choose 
the level of significance to be a = 10%. 


ry It 


Suppose the null hypothesis, Ho, is: the blood 
cultures contain no traces of pathogen X. 
State the Type I and Type II errors. 


Type I error: The researcher thinks the blood 
cultures do contain traces of pathogen X, 
when in fact, they do not. 


Type II error: The researcher thinks the blood 
cultures do not contain traces of pathogen X, 


when in fact, they do. 


ry It 


Suppose the null hypothesis, Ho, is: The 
victim of an automobile accident is alive 
when he arrives at the emergency room of a 
hospital. State the Type I and Type II errors. 
Further, determine an appropriate level of 
significance 


Type I error: The emergency crew thinks that 
the victim is dead when, in fact, the victim is 
alive. 


Type II error: The emergency crew does not 
know if the victim is alive when, in fact, the 
victim is dead. 


The error with the greater consequence is the 
Type I error. (If the emergency crew thinks 
the victim is dead, they will not treat him.) 
Therefore, we want to minimize a Type I 
error and, thus, we choose the level of 
significance to be small, i.e. a = 1%. 


ry It 


It’s a Boy Genetic Labs claim to be able to 
increase the likelihood that a pregnancy will 
result in a boy being born. Statisticians want 
to test the claim. Suppose that the null 
hypothesis, Ho, is: It’s a Boy Genetic Labs has 
no effect on gender outcome. State the Type I 
and Type II errors. Further, determine an 
appropriate level of significance 


Type I error: This results when a true null 
hypothesis is rejected. In the context of this 
scenario, we would state that we believe that 
It’s a Boy Genetic Labs influences the gender 
outcome, when in fact it has no effect. 


Type II error: This results when we fail to 
reject a false null hypothesis. In context, we 
would state that It’s a Boy Genetic Labs does 
not influence the gender outcome of a 
pregnancy when, in fact, it does. 


The error of greater consequence would be 
the Type I error since couples would use the 
It’s a Boy Genetic Labs product in hopes of 
increasing the chances of having a boy. 
Therefore, we want to minimize a Type I 
error and, thus, we choose the level of 
significance to be small, i.e. a = 1%. 


ry It 
Determine both Type I and Type II errors for the 
following scenario: 
Assume a null hypothesis, Ho, that states the 
percentage of adults with jobs is at least 88%. 


Assume a null hypothesis, Ho, that states the 
percentage of adults with jobs is at least 88%. 


Identify the Type I and Type II errors from 
these four statements. 


1. Not to reject the null hypothesis that the 
percentage of adults who have jobs is at 
least 88% when that percentage is 
actually less than 88% 

. Not to reject the null hypothesis that the 
percentage of adults who have jobs is at 
least 88% when the percentage is 
actually at least 88%. 

. Reject the null hypothesis that the 
percentage of adults who have jobs is at 
least 88% when the percentage is 
actually at least 88%. 

. Reject the null hypothesis that the 
percentage of adults who have jobs is at 
least 88% when that percentage is 
actually less than 88%. 


Type I error: c 


Type I error: b 


Homework 


In a population of fish in a lake used to be that 
42% were female. Then an oil spill in the lake 
happened. A test is conducted to see if, in fact, 
the proportion is now less. State the null and 
alternative hypotheses. 


1. Ho: wm = 0.42 
2. Het 1 = O42 


A random survey of 75 death row inmates in 
the U.S. revealed that the mean length of time 
on death row is 17.4 years with a standard 
deviation of 6.3 years. If you were conducting a 
hypothesis test to determine if the population 
mean time on death row is likely different from 


15 years, what would the null and alternative 
hypotheses be? 


I. 
2 


I. 
2: 


Ho: 
Ha: 


Ho: uw = 15 
Ha: pw # 15 


Some of the following statements refer to the 
null hypothesis, some to the alternate 
hypothesis. 


State the null hypothesis, Ho, and the 
alternative hypothesis. Ha, in terms of the 
appropriate parameter (wu or x ). 


I; 


2 


The mean number of years Canadians work 
before retiring is 34. 

At most 60% of Canadians vote in federal 
elections. 


. The mean starting salary for McGill 


University graduates is at least $100,000 
per year. 


. Twenty-nine percent of high school seniors 


get drunk each month. 


. Fewer than 5% of adults ride the bus to 


work in Calgary. 


. The mean number of cars a person owns in 


her lifetime is not more than ten. 

7. About half of Canadians prefer to live 
away from cities, given the choice. 

8. Europeans have a mean paid vacation each 
year of six weeks. 

9. The chance of developing breast cancer is 
under 11% for women. 

10. Private universities' mean tuition cost is 

more than $20,000 per year. 


34; Ha: u + 34 


= O11: Ban =O 1) 
< 


. Ho: 20,000; Ha: u > 20,000 


— 


al :u = 

2. Ho: t < 0.60; Ha: nm > 0.60 

3. Ho: up = 100,000; Ha: u < 100,000 
4. Ho: w = 0.29; Ha: tm # 0.29 

5. Ho: m = 0.05; Ha: 1 < 0.05 

6. Ho: uw <= 10; Ha: u > 10 

7. Ho: x = 0.50; Ha: tm # 0.50 

8. Ho: u = 6; Ha: u = 6 

9 oSt 

0 Ll 


For exercises a-d above, write the conclusions if 
a) we reject the null hypothesis, b) we do not 
reject the null hypothesis, and c) we 
INCORRECTLY accept the null hypothesis. 


1. a) There is sufficient evidence to suggest 
that the mean number of years Americans 


work before retiring is different 34. 

b) There is not sufficient evidence to 
suggest that the mean number of years 
Americans work before retiring is different 
34. 

c) There is sufficient evidence to suggest 
that the mean number of years Americans 
work before retiring is 34.(NOTE: 
REMEMBER THIS IS INCORRECT!!!) 

. a) There is sufficient evidence to suggest 
that more than 60% of Americans vote in 
presidential elections. 

b) There is not sufficient evidence to 
suggest that more than 60% of Americans 
vote in presidential elections. 

c) There is sufficient evidence to suggest 
that at most 60% of Americans vote in 
presidential elections.(NOTE: REMEMBER 
THIS IS INCORRECT!!!) 

. a) There is sufficient evidence to suggest 
that the percentage of high school seniors 
who get drunk each month is different 
from 29%. 

b) There is not sufficient evidence to 
suggest that the percentage of high school 
seniors who get drunk each month is 
different from 29%. 

c) There is sufficient evidence to suggest 
that twenty-nine percent of high school 
seniors get drunk each month.(NOTE: 
REMEMBER THIS IS INCORRECT!!!) 


4. a) There is sufficient evidence to suggest 
that fewer than 5% of adults ride the bus 
to work in Los Angeles. 

b) There is not sufficient evidence to 
suggest that fewer than 5% of adults ride 
the bus to work in Los Angeles. 

c) There is sufficient evidence to suggest 
that at least 5% of adults ride the bus to 
work in Los Angeles.(NOTE: REMEMBER 
THIS IS INCORRECT!!!) 


A statistics instructor believes that fewer than 
20% of Evergreen Valley College (EVC) students 
attended the opening night midnight showing 
of the latest Harry Potter movie. She surveys 84 
of her students and finds that 11 attended the 
midnight showing. An appropriate alternative 
hypothesis is: 


1 =-0.20 
2.5 = 0.20 
3.40: 0220 
4.m < 0.20 


State the Type I and Type II errors in complete 
sentences given the following statements. 


. The mean number of years Americans 
work before retiring is 34. 

. At most 60% of Americans vote in 
presidential elections. 

. The mean starting salary for San Jose State 
University graduates is at least $100,000 
per year. 

. Twenty-nine percent of high school seniors 
get drunk each month. 

. Fewer than 5% of adults ride the bus to 
work in Los Angeles. 

. The mean number of cars a person owns in 
his or her lifetime is not more than ten. 

. About half of Americans prefer to live 
away from cities, given the choice. 

. Europeans have a mean paid vacation each 
year of six weeks. 

. The chance of developing breast cancer is 
under 11% for women. 


. Type I error: We conclude that the mean is 
not 34 years, when it really is 34 years. 
Type II error: We do not conclude that the 
mean is not 34 years, when in fact it really 
is not 34 years. 

. Type I error: We conclude that more than 
60% of Americans vote in presidential 
elections, when the actual percentage is at 
most 60%.Type II error: We do not 
conclude that more than 60% of Americans 


vote in presidential elections when, in fact, 
more than 60% do. 

. Type I error: We conclude that the mean 
starting salary is less than $100,000, when 
it really is at least $100,000. Type II error: 
We do not conclude that the mean starting 
salary is less than $100,000 when, in fact, 
it is less than $100,000. 

. Type I error: We conclude that the 
proportion of high school seniors who get 
drunk each month is not 29%, when it 
really is 29%. Type II error: We do not 
conclude that the proportion of high 
school seniors who get drunk each month 
is not 29% when, in fact, it is not 29%. 

. Type I error: We conclude that fewer than 
5% of adults ride the bus to work in Los 
Angeles, when the percentage that do is 
really 5% or more. Type II error: We do 
not conclude that fewer than 5% of adults 
ride the bus to work in Los Angeles when, 
in fact, fewer that 5% do. 

. Type I error: We conclude that the mean 
number of cars a person owns in his or her 
lifetime is more than 10, when in reality it 
is not more than 10. Type II error: We do 
not conclude that the mean number of cars 
a person owns in his or her lifetime is 
more than 10 when, in fact, it is more than 
10. 

. Type I error: We conclude that the 


proportion of Americans who prefer to live 
away from cities is not about half, though 
the actual proportion is about half. Type II 
error: We do not conclude that the 
proportion of Americans who prefer to live 
away from cities is not about half when, in 
fact, it is not half. 

8. Type I error: We conclude that the 
duration of paid vacations each year for 
Europeans is not six weeks, when in fact it 
is six weeks. Type II error: We do not 
conclude that the duration of paid 
vacations each year for Europeans is not 
six weeks when, in fact, it is not. 

9. Type I error: We conclude that the 
proportion is less than 11%, when it is 
really at least 11%. Type II error: We do 
not conclude that the proportion is less 
than 11%, when in fact it is less than 11%. 


For statements a-i in the exercise above, 
determine whether one error is of more 
consequence than the other or if both are 
equally bad. 


Note: There can be more than one answer to 
these questions as the consequence often 
depends on the context. 


Nh 


iO CON O 


. Both errors are equally bad 
. Both errors are equally bad. 
. A type I error would be bad for school 


recruiters, while a type II error would be 
bad for current students 


. Both errors are equally bad. 
. If you are campaigning for less bus 


funding, then a type I error would be bad. 


. Both errors are equally bad. 
. Both errors are equally bad. 
. Both errors are equally bad. 
. A type I error would be bad as it could 


result in a decrease in funding or screening 
when there shouldn't be. 


When a new drug is created, the 
pharmaceutical company must subject it to 
testing before receiving the necessary 
permission from the Food and Drug 
Administration (FDA) to market the drug. 
Suppose the null hypothesis is “the drug is 
unsafe.” What is the Type II Error? 


I, 


2; 


To conclude the drug is safe when in, fact, 
it is unsafe. 

Not to conclude the drug is safe when, in 
fact, it is safe. 


. To conclude the drug is safe when, in fact, 


it is safe. 


. Not to conclude the drug is unsafe when, 


in fact, it is unsafe. 


Eight-Step Hypothesis Test - C Lemieux 
In this class, we will learn three hypothesis tests: 


* One population mean test when population 
standard deviation known 

* One population mean test when population 
standard deviation unknown 

* One population proportion test 


Regardless of the test we are doing, you are required 
to follow an eight-step procedure for doing the test: 


State HO and HA. This includes stating what the 
parameter represents. Summarize the sample data. 
For a means test, your evidence will consist of the 
sample mean, the standard deviation (which may be 
the sample or population), and the sample size. For 
a proportions test, your evidence will consist of the 
sample proportion and the sample size. State and 
justify the model (or distribution) being used. 
Determine what model or distribution you use (e.g. 
standard normal distribution or binomial 
distribution). This will included justifying the choice 
by referring to the conditions needed to use the 
model or distribution. Choose an appropriate level 
of significance. Consider the implications of a Type 
I vs. a Type II error in choosing your level of 
significance, as well as any ethical considerations. 
Calculate the test statistic and related p-value. 
Using the summaries of the sample data and Excel, 


compute the test statistic and the associated p-value. 
Discuss what the p-value measures in context. 
This involves interpreting what the p-value is a 
probability of in the context of the question. It does 
not involve making a decision (that comes next). 
Make a decision. Compare the p-value to the 
required strength of evidence (alpha) and determine 
if you can reject or fail to reject the null hypothesis. 
Offer a concluding sentence. Using accessible 
language summarize your conclusion in sentence 
form within the context of the problem and 
referring to the alternative hypothesis. 


One population means test when population 
standard deviation known - C Lemieux 

One population means test when population 
standard deviation known 


In this section, we will learn how to do a hypothesis 
test for one population mean when the population 
standard deviation is known. 


In general, a one population mean test is done when 
we are investigating only one random variable and 
the data is quantitative. Further, we are examining a 
situation where we are concerned about the 
measure of center of the data. 


Further, in this class, we are only examining the 
situation where the sampling distribution of sample 
means can be assumed to be normally distributed. 
The situation where it is not normally distributed is 
beyond the scope of this course. 


To learn the one means population test when the 
population standard deviation is known, we will 
examine our eight-step hypothesis test. 


The central limit theorem for the sampling 
distributions of sample means is important for 


doing Step 3 of the hypothesis test. You may want 
to review it before continuing. 


Eight-step hypothesis test for one- 
population mean test when population 
standard deviation is known 


State HO and HA. As this is a one-population 
means test, our hypotheses will include the symbol 
u. There are three versions of the alternative 
hypothesis and there opposing null hypotheses: 


¢ HA: u > some number ( u0 ) vs. HO: u<pO 
¢ HA: u < some number ( u0 ) vs. HO: u=>p0 
¢ HA: u + some number (0 ) vs. HO: u=p0 


For ease, we often write the null hypothesis simply 
as U=O as we only care about the equality for the 
null. This results in the following: 


* HA: u > uO vs. HO: p= (right-tailed test) 
* HA: u < uO vs. HO: u=u0 (left-tailed test) 
* HA: u ~ uO vs. HO: u=0 (two-tailed test) 


Notice how "some number" (i.e. WO ) is the same 
both for HA and HO . This is because they are 
opposites of each other. Therefore, the only thing 
that changes is the binary sign between them. 
Summarize the sample data. For a one population 
means test when the population standard deviation 
is known, your evidence will consist of the sample 
mean, the population standard deviation, and the 


sample size. 


The population standard deviation o can only be 
known if it is provided to you in the question!!! 
Remember that a hypothesis test is always done 
using sample data. Therefore, you can NOT find 
the population standard deviation by finding the 
standard deviation of the sample. To repeat: The 
population standard deviation is only known if 


it is clearly given to you in the question. For 
example, if the question states "the population 
standard deviation is 3" then you know the 
population standard deviation. But if the question 
only says "the sample standard deviation is 3" or 
you are only given the sample data, then the 
population standard deviation is NOT known. We 

ill discuss what to do in this situation in the next 
section. 


State and justify the model (or distribution) 
being used. As we are doing a hypothesis test for 
one population mean when the population standard 
deviation is known, we will be using the standard 
normal distribution as our model. To use this model, 
we need to determine two things: 


* Is the population standard deviation known? If 
it is clearly stated in the question (i.e. given to 
you), then the answer is yes. If it is not clearly 
given to you, then read the next section to find 
out what to do. 

Is the sampling distribution of sample means 
normally distributed? To determine this, we 
use the central limit theorem: 


© Is the sample size greater than 30? If yes, 
then the sampling distribution will be 
approximately normal regardless of the 
shape of the parent population. 

© If the sample size is less than 30, then we 
need to determine if the population 
distribution is normal. Because if the 
population distribution is normal, then the 
sampling distribution of sample means is 
normal regardless of the sample size. 
Either the shape of the population 
distribution will be stated in the question 
or you can determine it by examining a 
normal probability plot. 

© If the sample size is less than 30 and the 
population distribution is not normal, you 
do NOT have the knowledge to perform 
this hypothesis test. 


Choose an appropriate level of significance. 
Consider the implications of a Type I vs. a Type II 
error in choosing your level of significance, as well 


as any ethical considerations. Calculate the test 
statistic and related p-value. The test statistic for 
this situation is z* = X- -u0 0/n. The p-value is 
then found by finding the probability in the tail of 
the standard normal distribution where the test 
statistic is the boundary of the tail. 


¢ If it is a right-tailed test, find P(Z > z* ) 

¢ If it is a left-tailed test, find P(Z < z* ) 

- If it is a two-tailed test, find P(Z > | z* | ) 
times 2. 


Discuss what the p-value measures in context. 
Your sentence will look something like this: The 
probability we observed a sample mean of [insert 
appropriate inequality based on tail of test] ** 
[include relevant units], assuming [insert null 
hypothesis] is true, is [insert p-value]. Make a 
decision. 


* Do not reject HO if p-value is greater than or 
equal to a 
* Reject HO if p-value less than a 


Offer a concluding sentence. Using accessible 
language summarize your conclusion in sentence 
form within the context of the problem and 
referring to the alternative hypothesis. 


Suppose Irene, who owns a top bakery in the city, 
claims that she has the best bread in the city by 
any measure. Not only is her bread the tastiest, it is 
also the fluffiest and the tallest, averaging 15 cm in 
height. Another baker, Jose, wishes to challenge 
Irene’s claim that her bread is the tallest. As 
evidence he will provide a sample of 40 randomly 
selected loaves of bread and have their heights 
measured in his attempt to prove that his bread 
heights actually exceed 15 cm, on average. In 
doing so, he obtains a sample mean bread height of 
15.5 cm. He also knows from baking thousands of 
loaves that his variation is very low: specifically 
the population standard deviation is 0.9 cm. 


¢ State HO and HA. We define u in this 

problem to be the population mean of height 

of Jose's bread. 

Peeters ey 

HO:u= 15 

Note: This is a right-tailed test. 
¢* Summarize the sample data. The sample size 
is 40, the sample mean is 15.5 cm, and the 
population standard deviation is 0.9 cm. 
State and justify the model (or 
distribution) being used. As the population 
standard deviation is known and this is a one 
population means tests, we want to use the 
standard normal distribution. To do this, we 
need to ensure that the sampling distribution 
of sample means is normal. As the sample size 


is greater than 30, we can assume the 
sampling distribution of sample means is 
approximately normal. Therefore it is 
appropriate to do model this situation using 
the standard normal distribution. 

Choose an appropriate level of 
significance. The required strength of 
evidence can now be determined by 
considering the implications of a Type I vs. a 
Type II error. In this context, Jose will make a 
Type I error if he concludes that his bread 
heights average more than 15 cm when in fact 
they do not. He will make a Type II error if he 
concludes that there is not enough evidence 
that his bread heights average more than 15 
cm when in fact they do. Which error is worse 
will depend on where you are standing. Jose 
would consider a Type II error worse, whilst 
Irene would consider a Type I error worse. To 
be fair, we will choose a level of significance 
of 5%, which is generally considered a good 
balance between the two types of errors. This 
means, we will reject HO if the p-value < 
0.05. 

Calculate the test statistic and related p- 
value. The test statistic is found as follows: z* 
= (15.5-15)/(0.9/V40) = 3.51. Asitisa 
right-tailed test, we find P(Z>3.51) using the 
Excel function "= 1-NORM.S.DIST(3.51,1)". 
This gives us p = 0.0002. 

Discuss what the p-value measures in 


context. Under the assumption that Jose’s 
bread is no taller than Irene’s (this his bread 
averages only 15 cm), the probability of 
obtaining a sample of 40 with a mean of 15.5 
cm (or more) is only 0.0002 or 0.02%. 

* Make a decision. Since p (0.0002) < 0.05, 
we reject HO. 

* Offer a concluding sentence. Therefore we 
can conclude that the evidence suggests that 
Jose’s bread averages more than 15 cm and is, 
on average, taller than Irene’s. 


Charter Air claims that its new executive boarding 
service has improved the time it takes for business 
passengers to purchase tickets, store luggage and 
board the plane. They believe that is less than the 
previous time of 12 minutes. A sample of 9 
customers of this new exclusive service indicates 
the that the mean to complete these three tasks is 
9.3 minutes with a population standard deviation 
of 3.32 minutes. Previous studies have revealed 
that boarding times tend to follow a normal 
distribution. 


¢ State HO and HA. We define u in this 
problem to be the population mean time to 
complete these three tasks with the new 
executive boarding service. 
| seal Vices De 


HO: p = 12 

Note: This is a left-tailed test. 

Summarize the sample data. The sample size 
is 9, the sample mean is 9.3 minutes, and the 
population standard deviation is 3.32 minutes. 
State and justify the model (or 
distribution) being used. As the population 
standard deviation is known and this is a one 
population means tests, we want to use the 
standard normal distribution. To do this, we 
need to ensure that the sampling distribution 
of sample means is normal. As we are told that 
it is likely that the population distribution of 
time to complete these three tasks is normally 
distributed, we can assume the sampling 
distribution of sample means is approximately 
normal. Therefore it is appropriate to do 
model this situation using the standard normal 
distribution. 

Choose an appropriate level of 
significance. A Type I error in this case 
would be for Charter Air to claim their 
boarding time is less than 12 minutes, when in 
fact it is not. A Type II error in this case would 
be for Charter Air not to claim their boarding 
time is less than 12 minutes, when in fact it is. 
A Type I error could lead to false advertising, 
which has both ethical and legal implications, 
so it would be best to minimize the likelihood 
of making this type of error. Alpha should be 
set at 1% (or at most 5%). Reject HO if the p- 


value < 0.01. 

Calculate the test statistic and related p- 
value. The test statistic is found as follows: z* 
= (9.3-12)/(3.32/vV9) = -2.43. As it is a left- 
tailed test, we find P(Z< -2.43) using the 
Excel function "= NORM.S.DIST(-2.43,1)". This 
gives us p = 0.0203. 

Discuss what the p-value measures in 
context. The probability of getting a sample 
mean of 9.3 minutes (or less), with an 
assumed population mean of 12 minutes for 
completing these three boarding tasks, is 
2.03%. 

Make a decision. Since p (0.0203) > 0.01, 
we do not reject HO. 

Offer a concluding sentence. There is 
insufficient evidence to indicate that the mean 
time to complete these three tasks with the 
new executive boarding service has improved 
to be less than 12 minutes. 


One population means test when population 
standard deviation unknown - C Lemieux 
One population means test when population 
standard deviation unknown 


While working through the previous section you 
may have had a question: 

Difference between p-value for standard normal 
distribution vs. Student-t distribution 


If we don't know the population mean 
how do we know the population standard 
deviation?! ?!?!? 


That's a really good question. The actual formula for 
the population standard deviation involves knowing 
the population mean: ox = %\(X-u)2n. Therefore, if we 
don't know the population mean, how do we know 
the population standard deviation? 


There are two possible answers to this: 


1. In some long running process (e.g. 
manufacturing), the standard deviation may be 
very static. Therefore, the population standard 
deviation could be known even if the 
population mean isn't. 

2. We don't know the population standard 
deviation, so instead we estimate it with the 
sample standard deviation. 


It is fairly unlikely that in most situations, the 
population standard deviation will be known. Thus, 
we will focus on situations where the population 
standard deviation is unknown. In that case, we will 
use the sample standard deviation s to estimate the 
population standard deviation ox. 


Student-t Distribution 


To use this model to conduct a hypothesis test, we 
need to again assume that the sampling distribution 
of sample means is normal, and that the sample was 
collected randomly and is representative of the 
population. Just as we saw in the previous section, 
there are two general situations that need to occur 
to ensure the sampling distribution is normal: 


¢ If the sample size is greater than 30, then the 
central limit theorem tells us that we can 
assume that the sampling distribution is 
approximately normal regardless of the 
population distribution. Thus, if the sample size 
is greater than 30, we can use this model. 

¢ If the sample size is less than 30, the central 
limit theorem does not guarantee that the 
sampling distribution of the means will be 
normal. Therefore, to use this model the 
population distribution needs to be 
approximately normal so that we know that 
the sampling distribution for sample means is 
normal. 


Since we don't know the population standard 
deviation, we will be using the sample standard 
deviation to estimate ox. That means we will be 
performing the hypothesis test using the sample 
mean and sample standard deviation. This suggests 
that there may be more error in our test as we are 
using two statistics rather than just one. To account 
for the greater error, we want the p-value to be 
slightly bigger. That is, we want there to be more 
evidence before we reject the null hypothesis. 


When the population standard deviation is known, 
the test statistic was found by calculating X- -y0 o/n 
. Since the population standard deviation is now 
unknown, we find the test statistic by calculating X- 
-0 s/n. 


To find the p-value, we need to use a slightly 
different distribution than the standard normal 
distribution. In particular, we want to use a 
distribution that has a greater standard deviation 
than the standard normal distribution. This will 
make the distribution slightly wider, which result in 
the p-value being slightly bigger compared to the 
standard normal distribution. The distribution we 
will use is called the Student-t distribution . It has 
the same mean as the standard normal distribution 
(0), but it's standard deviation is larger than the 
standard normal (i.e. greater than 1). To illustrate, 
consider the image below: 


—— Student’st —— Std. Normal 


Some points about Figure 1: 


Notice how both distributions are centred at 0. 
That is, they both have a mean of 0. 

Notice how the Student-t distribution is wider 
than the standard normal distribution. This is 
because the standard normal distribution has a 
standard deviation of 1, while the Student-t 
distribution has a standard deviation of greater 
than 1. 

The p-value for the standard normal 
distribution is shown as the area highlighted in 
red. While the p-value for the Student-t 
distribution is highlighted by the blue lines. 
Notice that for the same test statistic, the 
standard normal distribution has a SMALLER p- 
value compared to the Student-t distribution. 


¢ This means that for the same test statistic it is 
HARDER to reject the null hypothesis for the 
Student-t distribution than it is for the standard 
normal distribution. 

¢ That is, the Student-t distribution accounts for 
the greater error in the hypothesis test due to 
the standard deviation being estimated by 
having larger p-values compared to the 
standard normal distribution. 


Information about the Student-t distribution 


* The Student-t distribution is a normal 
distribution with n=0 and o>1. The standard 
deviation of the Student t distribution is 
different for different sample size. Remember 
that the standard normal distribution is a 
normal distribution with n=0 and o= 1. 
Therefore, the Student-t distribution is centred 
at the same place as the standard normal 
distribution, but has greater variation so it is 
slightly wider and shorter. See [link]. 

* The smaller the sample size, the greater the 
variability is in the sampling distribution. 
When the sample size is larger, there is less 
variability in the sampling distribution. These 
aspects are reflected in shape of the Student-t 
distribution. 

* As the sample size n gets larger, the Student-t 
distribution gets closer to the standard normal 
distribution. 


The standard deviation of the Student-t distribution 
is based on the degrees of freedom, which in turn 
are based on the sample size. The number of degrees 
of freedom for a sample corresponds to the number 
of data values that can vary after certain restrictions 
have been imposed on all data values. Another way 
of saying it, is the degrees of freedom are the 
number of components that need to be known 
before a statistic is entirely determined. Depending 
on the model used, the degrees of freedom have a 
different formula. For this model (i.e. one 
population mean when population standard 
deviation is unknown), the degrees of freedom are 
the sample size minus 1, i.e. n-1. 


Eight-step hypothesis test for one 
population mean when population 
standard deviation is unknown 


Compare these steps to when the population 
standard deviation is known and you will find a lot 
of overlap. Notice when they are DIFFERENT! 


State HO and HA. As this is a one-population 
means test, our hypotheses will include the symbol 
u. There are three versions of the alternative 
hypothesis and there opposing null hypotheses: 


* HA: wu > some number (uO ) vs. HO: u<pO 


* HA: u < some number (0 ) vs. HO: p=u0 
* HA: u + some number (0 ) vs. HO: u=py0 


For ease, we often write the null hypothesis simply 
as U=0 as we only care about the equality for the 
null. This results in the following: 


* HA: u > yO vs. HO: p=0 (right-tailed test) 
* HA: u < uO vs. HO: p=n0 (left-tailed test) 
* HA: u ~ uO vs. HO: p=0 (two-tailed test) 


Notice how "some number" (i.e. uO ) is the same 
both for HA and HO. This is because they are 
opposites of each other. Therefore, the only thing 
that changes is the binary sign between them. 
Summarize the sample data. For a one population 
means test when the population standard deviation 
is unknown, your evidence will consist of the 
sample mean, the sample standard deviation, and 
the sample size. State and justify the model (or 
distribution) being used. As we are doing a 
hypothesis test for one population mean when the 
population standard deviation is unknown, we will 
be using the Student-t distribution as our model. To 
use this model, we need to determine two things: 


¢ Is the population standard deviation unknown? 
If the population standard deviation is not 
clearly given to you, then the population 
standard deviation is unknown and we should 
use the Student-t distribution. 


* Is the sampling distribution of sample means 
normally distributed? To determine this, we 
use the central limit theorem: 


© Is the sample size greater than 30? If yes, 
then the sampling distribution will be 
approximately normal regardless of the 
shape of the parent population. 

© If the sample size is less than 30, then we 
need to determine if the population 
distribution is normal. Because if the 
population distribution is normal, then the 
sampling distribution of sample means is 
normal regardless of the sample size. 
Either the shape of the population 
distribution will be stated in the question 
or you can determine it by examining a 
normal probability plot. 

© If the sample size is less than 30 and the 
population distribution is not normal, you 
do NOT have the knowledge to perform 
this hypothesis test. 


Choose an appropriate level of significance. 
Consider the implications of a Type I vs. a Type II 
error in choosing your level of significance, as well 
as any ethical considerations. Calculate the test 
statistic and related p-value. The test statistic for 
this situation is t* = X— -u0 s/n. The p-value is then 
found by finding the probability in the tail of the 
Student-t distribution where the test statistic is the 


boundary of the tail. 


¢ If it is a right-tailed test, find P(T > t* ) 

¢ If it is a left-tailed test, find P(T < t* ) 

- If it is a two-tailed test, find P(T > | t* | ) times 
Zs 


Discuss what the p-value measures in context. 
Your sentence will look something like this: The 
probability we observed a sample mean of [insert 
appropriate inequality based on tail of test] ** 
[include relevant units], assuming [insert null 
hypothesis] is true, is [insert p-value]. Make a 
decision. 


* Do not reject HO if p-value is greater than or 
equal to a 
* Reject HO if p-value less than a 


Offer a concluding sentence. Using accessible 
language summarize your conclusion in sentence 
form within the context of the problem and 
referring to the alternative hypothesis. 


Smurf Box Company has been running into some 
problems lately. In particular, the machine that 
produces cardboard boxes needs to be replaced. 
The purchaser at Smurf is trying to choose between 
two machines. Both machines produce good quality 


boxes, but she wants to determine if there is a 
difference in the average number of boxes 
produced per hour. To investigate, she randomly 
eight of the machine operators out of the forty in 
the shop and, after training on the new machines, 
lhad each operator produce as many boxes as they 
could in a one hour period. She recorded the 
number of boxes produced by each operator in a 
one period. Then, for each operator, she compared 
the number of boxes produced per machine by 
finding subtracting the number of boxes produced 
on Machine 1 (M1) by the number of boxes 
produced on Machine 2 (M2), that is she found M1- 
M2. The results are in the table below. 


Machine Operator 


1. Assuming that the population of differences is 


normally distributed, perform a hypothesis 
test to determine whether there is a difference 
between the mean number of boxes produced 
per hour by the two machines. 

2. Is there a level of significance that would 
cause you to change your decision? If so, what 
is it? If not, why not? 

3. If an error has been committed, what type is 
it? What would the error be? 


Solution 


1. State HO and HA. We define u in this 
problem to be the population mean difference 
in the number of boxes produced by both 
machines. 

HA: +0 

HO: yw=0 

Note: This is a two-tailed test. Summarize the 
sample data. The sample size is 8, the sample 
mean is 3.25 minutes, and the sample 
standard deviation is 1.488 minutes. State 
and justify the model (or distribution) 
being used. As the population standard 
deviation is unknown and this is a one 
population means tests, we want to use the 
Student-t distribution. To do this, we need to 
ensure that the sampling distribution of 
sample means is normal. As we are told that it 


is likely that the population distribution of the 
differences is normally distributed, we can 
assume the sampling distribution of sample 
means is approximately normal. Therefore it is 
appropriate to do model this situation using 
the Student-t distribution. Choose an 
appropriate level of significance. As the 
purchase of a new machine is a substantial 
investment, the purchaser would want strong 
evidence that there is a difference in the 
output of the machines. Therefore, she would 
want to set the level of significance at 1% (the 
LofS that requires the strongest evidence to 
reject the null. This means, that we reject HO 
if the p-value < 0.01. Calculate the test 
statistic and related p-value. The test 
statistic is found as follows: t* = (3.25-0)/ 
(1.488/vV8) = 6.177. As it is a two-tailed test, 
we find 2*P(T> 6.177) using the Excel 
function "= TDIST(6.177,7,2)" [TDIST(test 
stat, degrees of freedom, tail of test)]. This 
gives us p = 0.00046. Discuss what the p- 
value measures in context. The probability 
of getting a sample mean of at least 3.25 
(times 2), with an assumed population mean 0 
for the difference in output of the two 
machines, is 0.46%. Make a decision. Since p 
(0.00046) < 0.01, we reject HO . Offer a 
concluding sentence. There is sufficient 
evidence to indicate that there is a difference 
between the mean number of boxes produced 


per hour by the two machines. 


. Levels of significance are usually between 1% 
and 10%. As the p-value is clearly less than 
1%, there is no level of significance that 
would cause us to change our mind. 

. If an error is committed, it would be a type I 
error. In this case, the error would be the 
purchaser concluding that there is a mean 
difference in the number of boxes produced by 
the machines per hour, when in fact there is 
not. 


One population proportions test - C Lemieux 
proportions 


In this section, we will learn how to do a hypothesis 
test for one population proportion. 


In general, a one population proportion test is done 
when we are investigating only one random variable 
and the data is categorical. Further, we are 
examining a situation where we are concerned 
about the proportion, rate or percentage of the data 
that fit within one category. 


Further, in this class, we are only examining the 
situation where the sampling distribution of sample 
proportions can be assumed to be normally 
distributed. The situation where it is not normally 
distributed is beyond the scope of this course. 


To learn the one population proportion test, we will 
examine our eight-step hypothesis test. 


The central limit theorem for the sampling 
distributions of sample proportions is important for 


doing Step 3 of the hypothesis test. You may want 
to review it before continuing. 


Eight-step hypothesis test for proportion 


The process of doing hypothesis tests becomes 
redundant as you see more and more of them. 
Having said that, there are differences between how 
the tests are performed. Students often get confused 
between what needs to be done for each type of test 
because there are so many similarities. Therefore, 
compare the steps of the three tests we've learned 
(i.e. one population means test when population 
standard deviation is known, one population means 
test when population standard deviation is 
unknown, and one population proportions test) to 
see both where they are SIMILAR and where they 
are DIFFERENT. 


State HO and HA. As this is a one-population 
proportions test, our hypotheses will include the 
symbol x . There are three versions of the 
alternative hypothesis and there opposing null 
hypotheses: 


* HA: a > some number (20 ) vs. HO: <0 
* HA: a < some number (20 ) vs. HO: tm=x00 
* HA: a = some number (20 ) vs. HO: t=x00 


For ease, we often write the null hypothesis simply 
as U=0 as we only care about the equality for the 
null. This results in the following: 


* HA: xa > x0 vs. HO: n=2x0 (right-tailed test) 
* HA: a < x0 vs. HO: t=2x0 (left-tailed test) 
* HA: a = x0 vs. HO: t=x0 (two-tailed test) 


Notice how "some number" (i.e. 10 ) is the same 
both for HA and HO. This is because they are 
opposites of each other. Therefore, the only thing 
that changes is the binary sign between them. 
Summarize the sample data. For a one population 
proportions test, your evidence will consist of the 
number of successes or the sample proportion 
(either will be fine) and the sample size. State and 
justify the model (or distribution) being used. As 
we are doing a hypothesis test for one population 
proportion, we will be using the standard normal 
distribution as our model. To use this model, we 
need to determine two things: 


* Does the situation follow a binomial 
distribution? To check this simply verify that 
there are only two options (success and 
failure). Though there are other conditions for 
the binomial distribution, this is by far the 
most important so make sure it is met. 

* Is the sampling distribution of sample 
proportions normally distributed? To determine 
this, we use the central limit theorem, which 
states that we can assume the sampling 
distribution of proportions is normal if number 
of successes ( nz ) and failures (n(1—z) ) are 
both at least five. Use the evidence (i.e. the 


actual counted number of successes) to 
determine if this is the case. 


Choose an appropriate level of significance. 
Consider the implications of a Type I vs. a Type II 
error in choosing your level of significance, as well 
as any ethical considerations. Calculate the test 
statistic and related p-value. The test statistic for 
this situation is z* = p~* -m0 10 (1-20 ) /n . The p- 
value is then found by finding the probability in the 
tail of the standard normal distribution where the 
test statistic is the boundary of the tail. 


¢ If it is a right-tailed test, find P(Z > z* ) 

- If it is a left-tailed test, find P(Z < z* ) 

* If it is a two-tailed test, find P(Z > | z* | ) 
times 2. 


Discuss what the p-value measures in context. 
Your sentence will look something like this: The 
probability we observed a sample proportion of 
[insert appropriate inequality based on tail of test] 
** assuming [insert null hypothesis] is true, is 
[insert p-value]. Make a decision. 


* Do not reject HO if p-value is greater than or 
equal to a 
* Reject HO if p-value less than a 


Offer a concluding sentence. Using accessible 
language summarize your conclusion in sentence 
form within the context of the problem and 


referring to the alternative hypothesis. 


charitable organization wanted to see if a new 
form of mail marketing would change the 
percentage of people who replied. In the past the 
percentage of people who would reply to mail 
marketing was 1 in 175. A sample of 2000 letters 
was sent out. A total of 20 people responded. Is 
there any significant change in the percentage of 


State HO and HA . We define x in this problem to 
be the population proportion of respondents to the 
new mail marketing campaign. 

HA : 041/175 = 0.0057 

HO : t=0.0057 

Note: This is a two-tailed test. Summarize the 
sample data. The sample size is 2000, the number 
of successes is 20 and the sample proportion is 
20/2000 = 0.01. State and justify the model (or 
distribution) being used. As this is a one 
population proportions tests, we want to use the 
standard normal distribution. To do this, we need 
to ensure that 1) the situation meets the conditions 
of the binomial distribution and 2) the sampling 
distribution of sample proportions is normal. 


¢ The only condition we need to check for the 
binomial distribution is whether there are only 
two options. In this case, either the 
respondents submit the survey or they do not. 
Therefore, this situation meets this condition. 

¢ The sampling distribution of sample 
proportions can be assumed to be normal as 
the number of successes (20) and the number 
of failures (2000-20 = 1980) are both at least 
5: 


Choose an appropriate level of significance. The 
type I error in this situation is that the charity 
determines that there is change, when in fact there 
is not. While a type II error is that the charity does 
not have enough evidence to determine there is a 
change, when in fact there has been one. Most 
charitable organizations rely on fundraising as a 
main source of income. Therefore, it would be 
better to err on the side of making a Type II error 
over a Type I error. Therefore we will set alpha at 
10%. This means, that we reject HO if the p-value 
< 0.10. Calculate the test statistic and related 
-value. The test statistic is found as follows: Z* = 
(0.01 — 0.0057)/sqrt((0.0057*0.9943)/2000) = 
2.54. As it is a two-tailed test, we find 2*P(Z> 
2.54) using the Excel function "=2*(1- 
INORM.S.DIST(2.54,1))". This gives us p = 0.0111. 
Discuss what the p-value measures in context. 
The probability of getting a sample proportion of at 
least 1% (times 2), with an assumed population 


proportion of 0.57% rate of response to the survey, 
is 1.11%. Make a decision. Since p (0.0111) < 
0.10, we reject HO . Offer a concluding sentence. 
There is sufficient evidence to indicate that there is 
a significant change in the percentage of 
respondents with the new form of mail marketing. 


Introduction to confidence intervals - MRU - C 
Lemieux 
Introduction to collection on confidence intervals 


From Chapter 6, we know that if we take many 
samples of the same size from a population and 
calculate the sample means, the sample means will 
be clustered around the population mean, but many 
of them won't be exactly the same as the population 
mean. Therefore, we can estimate the population 
mean using a sample mean, but we expect there to 
be a certain amount of error in that estimate. To 
determine that error, we can look at the standard 
error. That is, we can look at the amount of 
variation between the sample means. 


In the chapter, we will use this information about 
how sample means behave to help us make 
estimates about the population mean of unknown 
populations. We will also do this with sample 
proportions and population proportions. That is, the 
goal of this chapter is to make inferences about the 
population from sample data. This is our first foray 
into inferential statistics. 


By the end of this section, the student should be 
able to 


¢ Find and interpret confidence intervals that 
estimate the population mean and the 
population proportion. 


* Understand the properties of the Student-t 
distribution. 

¢ For confidence intervals for the population 
mean, can determine whether to use the 
Student-t distribution or the standard normal 
distribution as a model. 

¢ Find the minimum sample size needed to 
estimate a parameter given a margin of error. 


What are confidence intervals? - MRU - C Lemieux 
Explanation of what confidence intervals are. 


Suppose you are trying to determine the mean rent 
of a two-bedroom apartment in your town. You 
might look in the classified section of the 
newspaper, write down several rents listed, and 
average them together. This provides a point 
estimate of the true mean. If you are trying to 
determine the percentage of times you make a 
basket when shooting a basketball, you might count 
the number of shots you make and divide that by 
the number of shots you attempted. In this case, you 
would have obtained a point estimate for the true 
proportion. 


A point estimate is a single value used to estimate 
a population parameter. For example, the sample 
mean is a point estimate of the population mean. 
But point estimates do not give a sense of how much 
error there is in an estimate. Thus, we instead want 
to provide an interval estimate for the population 
parameter takes into account error. The type of 
interval estimate we will learn about in this chapter 
is called a confidence interval. 


From our work on sampling distributions, we know 
that the sample mean probably won't be exactly the 
population mean. Instead we expect it to be slightly 
larger or smaller than the population mean. But by 

how much? The margin of error, denoted E, 


measures how much we expect the statistic to vary 
from the parameter. The margin of error is 
computed by looking at how much variation is in 
the sampling distribution and the level of 
confidence (discussed below). 


To calculate a confidence interval, you take the 
statistic and you add and subtract the margin of 
error from it. For example, if you are trying to 
estimate the population mean, you would take the 
sample mean and add and subtract the margin of 
error from it: x -E,x +E. This gives an interval of 
values that you expect the population mean to fall 
between. 


A recent opinion poll asked Canadians their 
opinion of the work of the current Prime Minister 
of Canada. 53% of Canadians approved of his work 
with a margin of error of 2.6%. The statistic is a 
sample proportion of 53% and we are trying to 
estimate the true proportion of Canadians who 


approved of the Prime Minister's work. We know 
that there will be error in that estimate and it has 
been measured to be 2.6%. Therefore, we are 
estimating that the true proportion of all Canadians 
who approve of the Prime Minister's work is 
between 53% + 2.6% or between 50.4% and 55.6%. 


Though confidence intervals change depending on 
the sample, but the parameter being estimated is 
fixed. For example, on a specific day, the 
population mean rent of a two-bedroom apartment 
in your town is a specific value. You are trying to 
estimate it, but it is fixed. The confidence interval, 
on the other hand, changes depending on the 
sample you take. Suppose instead of looking at the 
classified section of a newspaper, you looked at a 
rental website. Then the sample might be different, 

hich will result in a different confidence interval. 
Or suppose you stood outside a mall entrance and 
asked every fifth person what they paid in rent for 
their two-bedroom apartment, then your sample 

ould be different, which will result in a different 
confidence interval. These three different 
confidence intervals are all estimating the same 
thing, the population mean rent of a two-bedroom 
apartment in your town, but since each of the 
samples are different, the sample means will be 
different which will result in different estimates. In 
short, the parameter being estimated is not a 
random variable. But the confidence interval being 
used to estimate the parameter varies depending on 
the random sample taken. 


In the following sections, we will learn how to 
calculate the margin of error for the mean and 
proportion. For each situation, we will use a 


different model to find the margin of error. It should 
be noted that all of the models are based on the 
assumption that a random sample has been 
calculated. Therefore, finding a confidence interval 
based on the convenience sample of the rent in 
today's classified ads is not appropriate. This is 
important to remember when you are critically 
assessing a confidence interval provided to you. No 
matter how prettily the confidence interval is 
presented, if it was constructed from a non-random 
sample, it is useless. It is like baking an apple pie 
from rotten apples. It might look good, but it is still 
rotten. 

100 confidence intervals generated from 100 
random sample of the rent of two-bedroom 
apartments in your townOnline Statistics Education: 
A Multimedia Course of Study (http:// 
onlinestatbook.com/). Project Leader: David M. 
Lane, Rice University. 


Why is it called a confidence interval? 


If you are trying to estimate how much it will cost 
to go on a trip to Montreal for five days, you can 
work out with strong confidence the cost of the 
flight and hotels, but then you have to start making 
estimates about how much food and entertainment 
will cost while you're there. You can get a pretty 
good estimate of what it will cost, but your friend 
who you are trying to convince to come with you 
might want to know how confident you are in that 


estimate. Are you the kind of person who just 
guesses at the cost of meals or did you look at 
restaurantsO menus to come up with a sense of 
what meals cost in Montreal? Did you take into 
account snacks? The cost of renting a car or taking 
the bus? Did you assume you were going to do an 
equal number of free and paid admission activities? 
All of this affects the confidence you have in your 
estimate. 


For a confidence interval, it is much easier to 
determine how much confidence we have in our 
estimate because confidence intervals come with a 
level of confidence (or confidence level). 


To understand the confidence level, let's go back to 
the two-bedroom apartment situation. Let's now 
suppose that 100 people on the same day were very 
curious about determining the mean rent for two- 
bedroom apartments in your town. Each of these 
100 people went out and found their own random 
sample of fifty people who rent two-bedroom 
apartments in your town. From these 100 samples, 
100 confidence intervals were calculated. Based off 
of our work on sampling distributions, we know that 
the 100 sample means will be close to the 
population mean (some might even be the same as 
the population mean), but some will be closer and 
some will be farther. Thus some of the confidence 
intervals will be 'good' estimates of the population 
mean rent for two-bedroom apartments (that is, the 


population mean will actually be included in the 
confidence interval) and some will be 'bad' estimates 
(that is, the population mean won't actually be 
included in the confidence interval). Since the 
population mean is unknown none of the 100 
people who made these confidence intervals knows 
if their estimate is good or bad. Instead, they can 
only state how confident they are in their estimate. 
That is, they can only state their level of confidence. 


Suppose that all 100 people made 95% confidence 
intervals. What does that mean? Well suppose a 
local real estate company has actually worked out 
the population mean rent for two-bedroom 
apartments in your town by finding out the rent for 
all two-bedroom apartments. Since they know the 
population mean, they don't have to estimate it. 
They have found it to be $1200. 


[link] shows the 100 confidence intervals created by 
the 100 random samples and compares them to the 
population mean. If the interval is yellow then that 
means it is a good estimate. If it is red, then that 
means it is a bad estimate. The yellow part in the 
middle represent the 95% confidence interval. The 
yellow and the blue combined represent the 99% 
confidence interval. 


The above image was created using an applet from 


David Lane's onlinestatbook.com [footnote] 


Notice that out of the 100 confidence intervals 
calculated, 93 of them are good estimates (contain 
$1200) and seven of them are bad estimates (do not 
contain $1200). This is what the confidence level 
refers to. That is, if you take many, many random 
samples of the same size and construct a confidence 
interval for each of the samples, then the percentage 
of confidence intervals that contain the population 
mean is 95% and the percentage that do not contain 
the population mean is 5%. Thus, the confidence 
level refers to the probability that the process of 
creating a confidence interval results in the 
population parameter being in the confidence 
interval. It is NOT the probability that the 
population mean falls in a specific confidence 
interval. Remember that the population mean is 
fixed. Therefore, either the population mean does 
fall in the confidence interval or it doesn't. Since 
there is no randomness to whether it does fall or 
not, there is no probability associated with that 
event. Instead the level of confidence refers to the 
percent of confidence intervals that contain the 
parameter being estimated if the study/experiment 
is repeated many, many times. 


What has been described above is not an easy idea. 
Many people who have studied statistics are under 
the false impression that the confidence level refers 
to the probability that the parameter is in the 


confidence interval. Don't fret if this doesn't make 
entire sense to you right away. Give yourself some 
time to think about it and process it. 


As a note, the example provided in [link] is a bit 
surprising. If you flip a fair coin 100 times, you 
would expect that around 50 heads and 50 tails, but 
due to sampling variability it would also be fair to 
get 49 heads and 51 tails. It is the same thing with 
confidence intervals, we expect that for 100 
confidence intervals that around 95 of them contain 
the population mean and 5 of them don't, but it 
would be fair to get 94 good estimates and 6 bad 
ones. Once again, the law of large numbers tells us 
that as the sample size increases the closer we will 
get to the 95%. That is, if we take 1000 random 
samples instead of 100, the more likely it is that 
95% will be good estimates and 5% will be bad. 
Comparing different levels of confidence for the 
same random sample 


Common choices for confidence levels 


The most common choices for confidence levels are 
90%, 95%, and 99%, but you can choose the level of 
confidence to be any percentage between 0.00001% 
and 99.99999%, The can't choose 100%, because 
that would mean you for sure know that the 
population parameter falls within the confidence 
interval. You also can't choose 0%, because that 
would mean you for sure know that the population 


parameter does not fall within the confidence 
interval. If you knew for sure the parameter falls (or 
does not fall) in the confidence interval, you 
wouldn't be bothering to do a confidence interval, 
because you already know that parameter. 


90%, 95%, and 99% are common levels of 
confidence because they offer a high degree of 
confidence. 


How does the confidence level change the 
confidence interval? Think about the following two 
confidence intervals for the mean age of students at 
your university: 


4 years old to 85 years old 
20 years old to 21 years old 


Which confidence interval are you more confident 
actually contains the population mean? Well it is 
pretty likely that the population mean age of 
students at your university is somewhere between 4 
years old and 85 years old, because the range is so 
wide that it most likely “catches' the population 
mean. 


In general, the larger the confidence level, the wider 
the confidence interval. That is, to increase the 
confidence in the estimate, we make the confidence 
interval wider so that it is more likely to catch what 
we are estimating. Think about the confidence 


interval like a net. The smaller the net, the less 
likely it is you'll catch the fish. But the wider the 
net, the more likely it is that you will. Thus for the 
same sample, the 90% confidence interval is 
narrower than the 99% confidence interval. 


Thus, a 99% confidence interval is very reliable, but 
it gains reliability at the price of precision. That is, 
its wideness might come at the sake of usefulness. 
Going back to the confidence interval for the mean 
age of students at your university, we can be very 
confident that the population mean age is between 4 
and 85 years old, but that doesn't actually help 
understand what the population mean age is. We 
are less confident in the estimate of 20 to 21 years 
old, but it is providing us more useful information. 


To summarize, higher degrees of confidence mean 
that we are more sure that the parameter fall in the 
interval (i.e. more reliable). Lower degrees of 
confidence mean that the interval is smaller and 
thus gives us a better idea of where the parameter in 
question is (i.e. more precise). See [link] 


99% saiiieace 
98% scnhdcnce 
95% confidence 
90% confidence 
Saisie 


mean 


The choice of a 95% level of confidence is most 
common because it provides a good balance 
between precision and reliability. 


What else effects the width of a 
confidence interval? 


The width of the confidence interval is determined 
by the margin of error, E. In general, the confidence 
interval is calculated as follows: 


point estimate +E, point estimate -E 


The size of the margin of error determines the width 


of the confidence interval. That is, the bigger the 
margin of error is, the wider the confidence interval. 


Factors that effect the size of the confidence interval 
include the size of the sample, the amount of 
variability in the data, and the confidence level. 


As per the law of large numbers, the larger the 
sample size, the closer the statistic (or point 
estimate) is to the parameter. Therefore, the larger 
the sample size, the less error there is between the 
statistic and the parameter. This means that the 
margin of error is smaller for larger sample sizes 
taken from the same population. 


The greater the variability in the population, the 
greater the variability in the statistics. We saw this 
in Chapter 6 when we determined that the standard 
deviation of the sampling distribution was related 
both to the standard deviation of the population and 
the sample size. That is, the variation between the 
statistics relied both on the variation in the 
population and the sample size. Thus, the margin 
of error is larger in situations where there is 
more variability in the population. 


As stated above, the larger the confidence level, the 
wider the confidence interval. Therefore, the 
margin of error is larger for larger levels of 
confidence. 


Common misconceptions about 
confidence intervals 


1. The confidence interval contains 95% of the 
data values. A confidence interval is an 
estimate for a parameter (like the population 
mean or population proportion). Though the 
data values are used to construct the 
confidence interval, the confidence interval 
does not tell us anything about the range of the 
data values. 

2. We are 95% confident that the sample mean 
is contained in the confidence interval. If 
the confidence interval is for the population 
mean, then the sample mean has to be in the 
confidence interval. In fact, it is right in the 
middle. Remember that the confidence interval 
for the population mean is calculated as 
follows: x -E,x +E. All confidence intervals 
contain the point estimate being used to 
construct the confidence interval. 

3. Increasing the sample size increases the 
width of the confidence interval. In fact, the 
opposite happens. From the law of large 
numbers, we know that a larger sample size 
means that the point estimate will likely be 
closer to the parameter being estimated. 
Therefore, as the sample size increases, the 
margin of error decreases and the width of the 
confidence interval decreases. 

4. A 90% confidence interval is wider than a 


95% for the same data. Again, it is the 
opposite that happens. To become more 
confident in our estimate (i.e. increasing the 
level of confidence), we widen the confidence 
interval. A wider confidence interval is a larger 
net which makes it more likely that we catch 
the parameter we are estimating. 


Basic premise of constructing a confidence interval - 
MRU - C Lemieux 
Overview of how to construct a confidence interval 


In the above section, we discussed at length what a 
confidence interval is. Now we are going to discuss 
how to construct and interpret one. 


A confidence interval is constructed by taking the 
point estimate and adding and subtracting the 
margin of error. The margin of error is constructed 
by looking at the level of confidence and the 
amount of variation between the point estimates. 
For example, the margin of error for a confidence 
interval for a population mean is found by looking 
at the level of confidence (which the researcher 
determines) and the amount of variation between 
the sample means. The amount of variation between 
the samples means is the amount of variation in the 
sampling distribution for sample means, i.e. the 
standard error. Thus a confidence interval is 
always constructed from the appropriate 
sampling distribution. 


This is helpful in two ways: 


* From our work in Chapter 6, we know what the 
standard error is for both the sample mean 
ox =on and sample proportion op*=x(1-z)n. 

* From our work in Chapter 6, we know what the 
shape of the sampling distribution will be from 


the Central Limit Theorem. 


The margin of error is found by taking into 
account the confidence level and the standard 
error. 


The next section examines how the margin of error 
is constructed for confidence intervals for the mean. 


Confidence interval for the mean - MRU - C Lemieux 
(2019) 

Explanation of how to construct a confidence 
interval for the mean 


There are multiple models for finding the 
confidence interval for the mean. The models we 
will be looking at rely on the sampling distribution 
being approximately normal. If that is not the case, 
then we cannot use these models. 


Therefore, the following section relies on the 
following assumptions: 


* The sampling distribution for sample means of 
the population we are investigating is 
approximately normally distributed. 


© If the sample size is greater than 30, then 
the central limit theorem tells us that we 
can assume that the sampling distribution 
is approximately normal regardless of the 
population distribution. Thus, if the 
sample size is greater than 30, we can use 
this model. 

© If the sample size is less than 30, the 
central limit theorem does not guarantee 
that the sampling distribution of the 
means will be normal. Therefore, to use 
this model the population distribution 
needs to be approximately normal so 


that we know that the sampling 
distribution for sample means is normal. 


* The sample we are using to construct the 
confidence interval is a random sample. 


To construct a confidence interval for the mean, 
collect a random sample from the population whose 
mean is being estimated. Then calculate the sample 
mean. 


The next step is to calculate the margin of error. To 
do this, we begin by finding out how much sampling 
variability there is in the sampling distribution. That 
is, we determine how much variation we expect 
between the sample means. This is found by 
calculating the standard error of the sampling 
distribution for sample means: 

oX =oxX n 


Now we want to take into account the level of 
confidence. To do this, we construct a normal 
distribution that is centred at the sample mean, x, 
whose standard deviation is the standard error of 
the mean, oXn. The data values for this distribution 
are sample means. Therefore this is a sampling 
distribution for sample means. This sampling 
distribution is an estimate of what the sampling 
distribution of the population will look like: 

Blue curve: True sampling distribution for sample 
means centred at ux and with a standard deviation 


of oXn. Red curve: Estimate of the true sampling 
distribution for sample means based on the mean of 
the random sample. It is centred at x and has a 
standard deviation of oXn. 


In [link], the blue sampling distribution is the 
theoretical sampling distribution of the population, 
which is unknown. The red sampling distribution is 
an estimate of the blue curve based on the sample 
mean found from the random sample. We will use 
the red sampling distribution to estimate the 
population mean. 


Using the red sampling distribution, we want to 
determine the interval of sample means that fall 
within a specific percentage from the mean. The 
specific percentage is the confidence level. 


Suppose that the confidence level is 95.44%. From 
the empirical rule, we know that 95.44% of data 


values fall within 2 standard deviations of the mean 
for normally distributed data. Therefore, if we 
wanted to construct a 95.44% confidence interval, 
we would take the sample mean and add and 
subtract two standard deviations from it. Since we 
are dealing with a sampling distribution, the 
standard deviation we are referring to is the 
standard error of the mean. Therefore, a 95.44% 
confidence interval is found by calculating X™ 
+2:0X =X +2-0Xn. Thus for a 95.44% confidence 
interval, the margin of error is E=2-oXn. 

95.44% confidence interval for the mean 


standard deviation = 
standard error of the 


mean 


95.44% confidence interval 


If we wanted to find a 95% confidence interval, we 
would use the same process, but we would want a 
slightly narrower interval. Therefore, instead of 
multiplying the standard error by 2, we would 
multiply it by a slightly smaller number. To 
determine by what number, we would need to find 
out how many standard deviations away from the 
mean results in an area of 95%. In other words, we 
would need to find the z-score that gives an area of 
95%. 


Standard normal curve with the area of the tails 
being 5%. 


If the area in the middle of the curve is 95%, then 
the area of one tail is 2.5%. Using a computer 
program, we can find this value to be +1.96. 


To do this, go to your computer program and go to 
the menu option that lets you find probabilities for 
normal distributions. Then make the mean 0 and the 
standard deviation 1. Then switch from calculating 
probabilities to finding z-values (like you are going 
to find a percentile). In the appropriate box, put 
0.0025 in for the area in the upper tail. When you 
hit enter, the program will give you 1.96 as the z- 
value for this area. 


In general, the value that you multiply the standard 
error by is called the critical value and is denoted 
by za/2, where a is the total area of the tails. (1- 
a) X 100% is the level of confidence. 


The margin of error is E=za/2 X oXn 


The confidence interval is x +E. As it is an interval, 
always write it with the smaller number first (x -E) 
followed by the larger number (x +E). 


Suppose that a random sample of 175 students 
from a university is taken and their average age 
is 21.34 years old and the population standard 
deviation is known to be 5.12 years. 


1. Find the 95% confidence interval for the 
population mean age of all university 
students. 

2. Interpret the confidence interval in the 
context of the question. 

3. Explain what the level of confidence means 
in the context of the problem. 

4. If we decreased the sample size to 100, 
what would you expect to happen to the 
confidence interval? Explain your answer. 

5. Suppose that an administrator at the 
university claims that this university caters 
to older students and that the mean age is 
23. Does the confidence interval support 
the claim? 


1. We can use the standard normal model to 
find the confidence interval, because the 
sample was collected randomly and, since 


the sample size is greater than 30 (it is 
175), we can be very confident that the 
sampling distribution for the sample means 
is normal due to the central limit theorem. 
To find the confidence interval, use a 
computer program. Make sure to choose 
the z-model (instead of the t-model). Input 
the sample size as 175, the sample mean as 
21.34 and the standard deviation as 5.12. 
Choose the level of confidence to be 95%. 
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From this, we can see that the confidence 
interval for the mean is 20.58 to 22.10. 

. To interpret the confidence interval, we 
would say that we are 95% confident that 
the population mean age of students from 
this university is somewhere between 
20.58 years old and 22.10 years old. That 
is, we are estimating that the population 
mean age is somewhere between 20.58 
years old and 22.10 years old. 

. The confidence level means that if we took 
many random samples of size 175 from the 


student body of this university and 
constructed many confidence intervals for 
each of these random samples, then 95% of 
these confidence intervals will contain the 
population mean age for this university, 
while 5% will not. 

4. If the sample size is decreased to 100, we 
would expect that the confidence interval 
would get wider. From the law of large 
numbers, we know there is more sampling 
variability in smaller samples. Thus there 
is more potential for error between the 
sample mean and the population mean 
when the sample size is smaller. The 
margin of error then is bigger to take this 
into account. This is supported by the 
formula for the margin of error 
(za/2 xX on). Since we are dividing by the 
n, the margin of error would be smaller for 
larger n and bigger for smaller n. 

5. We have estimated that the population 
mean age is between 20.58 years old and 
22.10 years old. Therefore, based on our 
estimate, it is unlikely that the mean age of 
this university is 23 years old as 23 does 
not fall within our estimate. The 
administrator's claim is most likely 
incorrect. 


A few notes about the above confidence interval: 


All of the means in the interval are equally 
likely. That is, each of the estimates of the 
population mean in the interval have an equal 
chance of being correct. For example, 20.58 
years old and 21.25 years old are both equally 
likely estimates of the population mean age. 
The sample mean of 21.34 is right in the 
middle of the interval. 

The margin of error is 0.759 and is found using 
the formula 

Zza/2 xX sn=1.96 X 5.12175 

It is possible that the population mean is not 
captured by this confidence interval, but we 
wouldn't know whether it does or not without 
knowing the population mean. 


Wait a second! If we don't the population 
mean (ux), how do we know the 
population standard deviation (ox) in the 
standard error formula??? 


That's a really good question. The actual formula for 
the population standard deviation involves knowing 
the population mean: ox = %\(X-u)2n. Therefore, if we 
don't know the population mean, how do we know 
the population standard deviation? 


There are two possible answers to this: 


1. In some long running process (e.g. 
manufacturing), the standard deviation may be 
very static. Therefore, the population standard 
deviation could be known even if the 
population mean isn't. 

2. We don't know the population standard 
deviation, so instead we estimate it with the 
sample standard deviation. 


It is fairly unlikely that in most situations, the 
population standard deviation will be known. Thus, 
we will focus on situations where the population 
standard deviation is unknown. In that case, we will 
use the sample standard deviation s to estimate the 
population standard deviation ox. 

The Student-t distribution was created by William 
Gosset, an English statistician who worked for 
Guinness breweries. While working for Guinness, 
Gosset developed the Student-t distribution, but was 
prohibited from publishing his work by his 
employers who worried about trade secrets getting 
out. Thus he published his work under the 
pseudonym “Student' in 1907. The distribution, 
then, should really be called the Gosset-t 
distribution. Comparison of Student-t distribution 
with standard normal distribution Critical value for 
Student-t distribution with n=5 


Student-t distribution 


To use this model to construct a confidence interval, 


we need to again assume that the sampling 
distribution is normal and that the sample was 
collected randomly. Just as we saw above, there are 
two general situations that need to occur to ensure 
the sampling distribution is normal: 


* If the sample size is greater than 30, then the 
central limit theorem tells us that we can 
assume that the sampling distribution is 
approximately normal regardless of the 
population distribution. Thus, if the sample size 
is greater than 30, we can use this model. 

¢ If the sample size is less than 30, the central 
limit theorem does not guarantee that the 
sampling distribution of the means will be 
normal. Therefore, to use this model the 
population distribution needs to be 
approximately normal so that we know that 
the sampling distribution for sample means is 
normal. 


Since we don't know the population standard 
deviation, we will be using the sample standard 
deviation to estimate ox. That means we are 
estimating the population mean using the sample 
mean and sample standard deviation. This suggests 
that there may be more error in our estimate. To 
account for the greater error, we want the 
confidence interval to be slightly wider. To do this 
the margin of error needs to slightly bigger. The 
margin of error is the critical value x the standard 


error. The standard error is inherent to the 
population and can't be changed, but the critical 
value can be. So instead of using the standard 
normal distribution to find the critical value, we use 
the Student-t distribution [footnote] 


Here is some information about the Student-t 
distribution. 


* The Student-t distribution is a normal 
distribution with n=0 and o>1. The standard 
deviation of the Student t distribution is 
different for different sample size. Remember 
that the standard normal distribution is a 
normal distribution with pn=0 and o= 1. 
Therefore, the Student-t distribution is centred 
at the same place as the standard normal 
distribution, but has greater variation so it is 
slightly wider and shorter. See [link]. 

* The smaller the sample size, the greater the 
variability is in the sampling distribution. 
When the sample size is larger, there is less 
variability in the sampling distribution. These 
aspects are reflected in shape of the Student-t 
distribution. 

* As the sample size n gets larger, the Student-t 
distribution gets closer to the standard normal 
distribution. 


Standard normal 
Student -L:in=5 


Student -L: n= 20 


The standard deviation of the Student-t distribution 
is based on the degrees of freedom, which in turn 
are based on the sample size. The number of degrees 
of freedom for a sample corresponds to the number 
of data values that can vary after certain restrictions 
have been imposed on all data values. Another way 
of saying it, is the degrees of freedom are the 
number of components that need to be known 
before a statistic is entirely determined. Depending 
on the model used, the degrees of freedom have a 
different formula. For this model (i.e. confidence 
interval for one population mean), the degrees of 
freedom are the sample size minus 1, i.e. n-1. 


As stated above, we want the width of the 
confidence interval to be wider to take into account 
the larger variation due to the estimate of the 
standard deviation. As you can see from the figure 
above, the Student-t distribution is wider than the 
standard normal distribution. Which means that the 


critical value for a 95% confidence level will be 
greater than that for the standard normal. See the 
image below. 


Standard normal 
Student -L:n=S5 


Notice the critical value is happening about halfway 
between +2 and +3. But the critical value for the 
standard normal distribution is + 1.96. 


The margin of error for this model is: 
E=ta/2xsn 


The confidence interval is constructed in the same 
way: X +E. 


A manufacturer of AAA batteries wants to 
estimate the mean life expectancy of the 
batteries. It is known that the life expectancy of 
such batteries is typically normally distributed. 


A random sample of 25 batteries has a mean of 
44.25 hours and a standard deviation of 2.25 


hours. Assume the population is normal. 


1. Construct a 95% confidence interval for 
the mean life expectancy of all the AAA 
batteries made by this manufacturer. 

2. Interpret the 95% confidence interval. 

3. If the confidence level is decreased to 90%, 
how does the confidence interval change? 


1. We can use the Student-t distribution 
model to construct the confidence interval, 
because the population standard deviation 
is unknown (so we don't use the standard 
normal distribution), the sample is 
collected randomly, and the sampling 
distribution of the sample means is normal 
because the population distribution is 
normal. To find the confidence interval, 
use a computer program. Make sure to 
choose the t-model (instead of the z- 
model). Input the sample size as 25, the 
sample mean as 44.25 and the standard 
deviation as 2.25. Choose the level of 
confidence to be 95%. This gives the 
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From this, we can see that the confidence 
interval for the mean is 43.321 to 45.179. 

2. To interpret the confidence interval, we 
would say that we are 95% confident that 
the true mean battery life of brand of AAA 
batteries is somewhere between 43.32 
hours and 45.18 hours. 

3. If the confidence level is decreased to 90%, 
we would expect that the confidence 
interval would get narrower. A higher level 
of confidence is obtained by making the 
confidence interval wider. Therefore, if the 
confidence level is decreased, then the 
confidence interval would get narrower. 


Notice from the computer output, that the 
critical value is 2.064 with 24 degrees of 
freedom (i.e one less than the sample size). If 
the population standard deviation was known, 
the critical value would be 1.96. To re-iterate, 
since we are estimating the population standard 
deviation with the sample standard deviation, 
we know there is more room for error in the 
estimate. Therefore, we want the estimate (i.e. 
confidence interval) to be slightly wider, thus 
the margin of error needs to be slightly bigger. 
This is done by using the Student-t distribution, 


which results in bigger critical values for the 
same confidence level as would occur for the 
standard normal distribution. In this case, 
2.064. 


[link] is a flow chart that indicates how to make a 
choice of which model to use to construct a 
confidence interval (CI) for the mean. 

Flow chart for determining which model to use 
when constructing confidence interval for the mean 


Is the sampling 
distribution normal? 


| 


Is the population 
distribution normal? 


ys 


Then the sampling : 
distribution is Is the sample size 


normal, regardless of greater than 30? 


the sample size. 
Yes 
No 


Then the sampling 


distribution is Then the sampling 
approximately normal, distribution is NOT 
due to the central limit guaranteed to be normal. 


STOP! None of the models 


theorem 
you’ve learned can help 
you. Wait until MGMT2263 
to answer the question. 


Is the population standard 
deviation known? 


Use the Student-t 
Use the standard distribution as the model. 
normal 
distribution, z, 
as the model. 


Sample Size Determination 


Determining an appropriate sample size is very 
important. Too small of a sample may lead to poor 
results. Too large of a sample needlessly wastes time 
and money. 


Prior to this section, we would have determined if a 
sample size was large enough simply by guessing. 
Here we will learn a formula for finding the 
appropriate sample size based on the amount of 
error we will accept in our results. This can be done 
by determining the minimum sample size needed to 
have a certain margin of error. To do this, we solve 
for the sample size n in the margin of error formula. 
E=za2:snn=zZza/2:‘sEn=Za/2:sE2 


As we would always rather than have one more 
object of study rather than one less, we will always 
round up the result of this calculation. That is, if the 
result of the formula is 50.2, then we will round up 
tort, 


A couple of notes about the formula: 


1. Since n is unknown we can't use t. Think about 
why this is so. 

2. We still need to have a sense of the standard 
deviation to use this formula. As such, we will 
often do a preliminary study to estimate of the 
standard deviation. 


You plan to do a study of hypnotherapy to 
determine how effective it is in increasing the 
number of hours of sleep participants get each 
night. To do this you will measure the number 
of hours of sleep for each of the participants 
after they've done hypnotherapy. You want to 
ensure that your estimate for the mean number 
of hours of sleep is within 0.2 hours of the true 
mean with a 95% level of confidence. Prior to 
doing the full study, you do a pilot study with 
12 participants, which provides the following 
cater 

ole 737 28263629 C1 2310.13 
2959225729510 25 


How many participants should be in your 
study? 


We know the confidence level (95%). The 
margin of error is stated by saying that we want 
the estimate of the true mean to be within 0.2 
hours. Thus the 0.2 hours is telling us how 
much error we want in the estimate (i.e. 
E=0.2). We do need to have a sense of the 
standard deviation, which we get from the 
preliminary study. Using the 12 participants, 
we get a sample standard deviation of 1.29. 


We can now use a computer program to do the 


calculation. From the question, we know the 
margin of error (E) is 0.2, the standard 
deviation is 1.29, and the confidence level is 
95%. When we input this into the computer 
program, we get output similar to this. 
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From this, we can see that to get our sample 
size within 0.2 hours of the true mean we 
would need a sample size of at least 160 
participants. 


Confidence interval for proportion - MRU - C 
Lemieux 

Explanation of how to find and interpret a 
confidence interval for proportion and sample size 
determination. 


Here we want to construct a confidence interval to 
estimate the population proportion x based off of 
the point estimate of the sample proportion p’. 


Confidence intervals for proportion are constructed 
by taking the point estimate p* and adding and 
subtracting the margin of error E: p°+E. 


There is more than one model for constructing a 
confidence interval for the sample proportion. The 
model we will discuss here has the following 
criteria: 


* The variable being studied satisfies the 
conditions of the binomial distribution. 

¢ The sampling distribution for sample 
proportions is approximately normal. This 
occurs if the number of successes (n X zt) is at 
least 5 and the number of failures (n x (1-z)) is 
at least 5. As a is unknown this can be checked 
by determining if the number of successes and 
failures in the sample are both at least 5. 


The margin of error is found in a similar way to 
margin of error for the mean. That is, it is the 


critical value x the standard error. As we are 
assuming that the sampling distribution is 
approximately normal, we will use the standard 
normal distribution to find the critical value. Since 
the variable being studied satisfies the conditions of 
the binomial distribution, we know from Chapter 6 
that the standard error of the sampling distribution 
is w(1-1)n. As we don't know x as that is what we 
are trying to estimate, we will estimate x in the 
formula with the sample proportion p*. This results 
in the estimate of the standard error to be p*(1-p’)n 


If these conditions are met, then the formula for the 
margin of error is: 
E=za/2xp°(l1l-p*)n 


Example: Cell phones 


Suppose that a market research firm is hired to 
estimate the percent of adults living in a Vancouver 
who have cell phones. Five hundred randomly 
selected adult residents in Vancouver are surveyed 
to determine whether they have cell phones. Of the 
500 people sampled, 421 responded yes - they own 
cell phones. 


1. Using a 92% confidence level, compute a 
confidence interval estimate for the true 
proportion of adult residents of this city who 
have cell phones. 

2. Would it be appropriate to say that 85% of 


residents have a cell phone in Vancouver? 
3. What does the confidence level tell us in the 
context of the question? 


Solutions: 


1. We can use the standard normal model for 
proportions to construct our confidence interval 
as the variable (cell phone ownership) follows 
a binomial distribution (1: The variable is 
random (random sample); 2: The outcomes are 
being counted (number of people who have cell 
phones); 3: There is a fixed number of trials 
(500); 4: There are two possible outcomes 
(have cell phone or don't have cell phone); 5: 
Though zx is unknown it is fair to assume that 
the proportion of people who have a cell phone 
on a given day in Vancouver is very stable) and 
the sampling distribution for proportions is 
normal as the number of successes is 421 and 
the number of failures is 79 (i.e. they are both 
greater than 5). Use a computer program to 
construct the confidence interval. Input x as 
421 (this may be in the same place as the 
sample proportion, but when you input the 
whole number it will switch to x), the sample 
size as 500, and the confidence level as 92%. 
Notice that you don't have to state whether it is 
z or t as there is only one model for this 
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From this, we can see that the confidence 
interval for the mean is 0.813 to 0.871. 

2. To interpret the confidence interval, we would 
say that we are 92% confident that proportion 
of residents of Vancouver that own a cell phone 
is somewhere between 81.3% and 87.1%. 

3. Since 85% is contained in the confidence 
interval, it is appropriate to say that the 
proportion of residents in Vancouver who have 
a cell phone is 85%. 

4. The confidence level means that if we took 
many random samples of Vancouver residents 
of size 500 and constructed many confidence 
intervals for each of these random samples, 
then 92% of these confidence intervals will 
contain the population proportion of cell phone 
users, while 8% will not. 


A couple of notes about the confidence interval: 


¢ The margin of error is 0.029 or 2.9%. The 
margin of error for a confidence interval for 
proportions has to be less 1 (or 100%). If the 
sample size is large enough, the margin of error 
should be quite small (less than 10%). 

¢ Since proportions can only range from 0 to 1 or 


0% to 100%, the confidence interval can never 
exceed these values. For example, if the sample 
proportion is 92% and the margin of error is 
10%, then the confidence interval would be 
82% to 102%, but since the upper bound is 
impossible, we would round the answer to 82% 
to 100%. 


Determining sample size 


Just like with the mean, we want to determine an 
appropriate sample size to achieve a maximum 
amount of error in our estimate for the population 
proportion. 


To find the formula for n, we again solve for n in 
the formula for the margin of error, this results in 
the following formula: 
n=za/22p°(1-p°)E2 


To use this formula we need to know the margin of 
error, the confidence level and the sample 
proportion. 


Note: If no estimate for 1 exists, then use p°=0.5. 


The Western Canada Communications Company 
is considering a bid to provide long-distance 
phone service. You are asked to conduct a poll 


to estimate the percentage of consumers who 
are satisfied with their current long-distance 
phone service. You want to be 90% confident 
that your sample percentage is within 2.5 
percentage points of the true population value, 
and a Roper poll suggests that this percentage 
should be about 85%. How large must your 
sample be? 


The confidence level is 90%, the sample 
proportion is 85%, and the amount of error we 
want in our estimate (i.e. the margin of error) is 
2.5%. 


We can now use a computer program to do the 
calculation. From the question, we know the 
margin of error (E) is 0.025 (remember to write 
it as a decimal), the sample proportion is 0.85, 
and the confidence level is 90%. When we input 
this into the computer program, we get output 
similar to this. 
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552 rounded up 


From this, we can see that we need to have at 
least 552 consumers in our sample. 


Practice questions (2019) 
Eight practice questions for the end of unit on one- 
sample hypothesis tests and confidence intervals. 


Practice questions for Chap. 7 & 8 


These questions were derived from Lyryx Learning, 
Business Statistics I -- MGMT 2262 -- Mt Royal 
University -- Version 2016 Revision A. OpenStax 
CNX. Sep 8, 2016 http://cnx.org/contents/ 
f3aefa9e-58d2-41ea-969f-04dc2cb04c82@5.5. 


If a question has a set of data, please see the course 
site for the Excel file. 


Solutions are at the end of the chapter. 


1. Question 1: The Specific Absorption Rate (SAR) 
for a cell phone measures the amount of radio 
frequency (RF) energy absorbed by the user's 
body when using the handset. Every cell phone 
emits RF energy. Different phone models have 
different SAR measures. To receive certification 


from the Federal Communications Commission 
(FCC) for sale in the United States, the SAR 
level for a cell phone must be no more than 1.6 
watts per kilogram. Table 7.1 shows the highest 
SAR level for a random selection of cell phone 
models as measured by the FCC. A recent study 
has shown that if a cell phone's SAR level 
exceeds 0.9 watts per kilogram, there is an 
increased chance of brain tumours for those 
that use this phone[footnote] An advocacy 
group wants to use this new study to petition 
the FCC to change their regulations around the 
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1. What is the variable being studied? 
Categorize it. Based on this, what 


descriptive statistic (mean or proportion) 
is best for this situation? 

. Is it appropriate to assume that the 
sampling distribution is normal? Explain 
your reasoning and provide evidence for 
your choice. Regardless of your answer in 
b), assume that the sampling distribution 
is normal for the remaining questions. 

. The advocacy group will go forward with 
their petition if they can show that, on 
average, cell phones have SAR rates that 
exceed 0.9 watts per kg. This advocacy 
group is run by an administrator who is 
very risk averse (meaning they will only 
go forward with the petition if there is a 
lot of evidence). Determine whether the 
advocacy group should go forward with 
their petition by performing an 
appropriate eight-step hypothesis test. 

. Find a confidence interval for the true 
(population) mean of the Specific 


Absorption Rates (SARs) for cell phones. 
Choose a confidence level that 
complements the level of significance you 
have chosen above. 

5. Interpret the confidence interval in the 
context of the question. 

6. Does the confidence interval suggest that 
the mean SAR exceeds 0.9? Compare your 
answer with what you got for the 
hypothesis test. Do the confidence interval 
and hypothesis test support each other? 
Explain your answer. 


This is completely made-up. 

. Question 2: A hospital is trying to cut down on 
emergency room wait times. In the past, they 
have found that the average wait time is 1.4 
hours for patients to be called back to be 
examined. They have implemented a new 
triage protocol and are interested in seeing if it 
has changed the amount of time patients must 
wait before being called back to be examined. 
An investigation committee randomly surveyed 
70 patients. The sample mean wait time was 
1.5 hours with a sample standard deviation of 
0.5 hours. 


1. What is the variable being studied? 
Categorize it. Based on this, what 
descriptive statistic (mean or proportion) 
is best for this situation? 


. Use an appropriate eight-step hypothesis 
to determine if the average wait time for 
patients to be called back to be examined 
has changed from 1.4 hours. Use a level of 
significance of 10%. 

. Is there a level of significance that causes 
you to change your decision? 

. Suppose the true population mean wait 
time is 1.4 hours, have you made an error 
in b)? If so, what type? 

. Construct a 90% confidence interval for 
the population mean emergency room wait 
times. 

. Interpret the confidence interval in the 
context of the question . 

. If the investigation committee wants to 
increase its level of confidence and keep 
the margin of error the same by taking 
another survey, what changes should it 
make? 

. If the investigation committee did another 
survey, kept the margin of error the same, 
and surveyed 200 people instead of 70, 
how would the level of confidence have to 
change? Why? 

. Suppose investigation committee wanted 
their estimate of the population mean 
emergency room wait times to be within 
0.05 hours of the true mean. How many 
patients would they need to interview? 


3. Question 3: Twenty-five Americans were 
surveyed to determine the number of hours 
they spend watching television each month. 
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Assume that the underlying population 
distribution is normal and the population 
standard deviation is known to be 32 hours. 


1. What is the variable being studied? 
Categorize it. Based on this, what 
descriptive statistic (mean or proportion) 
is best for this situation? 

2. The U.S. government has recently released 
a recommendation that Americans watch 
less than 150 hours of television per 
month. Based on this sample, is there 
enough evidence to suggest that, on 
average, Americans are meeting this 
recommendation? Base your answer on an 
appropriate eight-step hypothesis test. Use 

= 5%. 

3. Construct a 99% confidence interval for 
the population mean hours spent watching 
television per month. 

4. Interpret the confidence interval in the 


context of the question. 


5. Explain what the confidence level means 


in the context of the question. 


4. Question 4: The standard deviation of the 
weights of newborn elephants is known to be 
approximately 15 pounds. We wish to construct 
a 95% confidence interval for the mean weight 
of newborn elephant calves. Fifty newborn 
elephants are weighed. The sample mean is 244 
pounds. The sample standard deviation is 11 
pounds. 


Ts 


What model will you use to construct a 
confidence interval for the population 
mean? Explain your reasoning by referring 
to the criteria for that model. 


. Construct a 95% confidence interval for 


the population mean weight of newborn 
elephants. 


. What will happen to the confidence 


interval obtained, if 500 newborn 
elephants are weighed instead of 50? 
Why? 


. Based on the confidence interval, is it fair 


to say that the average weight of a 
newborn elephants exceeds 235 pounds? 
Explain your answer. 


. Does an appropriate hypothesis test 


support your decision in d)? Explain your 
answer by doing the eight-step hypothesis 


test. 


5. Question 5: A news magazine is investigating 
the changing dynamics in marriages. 
Historically, men made many of the financial 
decisions including the decision on whether to 
make major household purchases (such as 
buying a new vehicle or doing a renovation), 
while women were left out of them. To 
investigate whether this has changed, the 
magazine is considering doing a study to find 
out the percentage of couples who are equally 
involved in making decision about household 
purchases. 


1. What is the variable being studied? 
Categorize it. Based on this, what 
descriptive statistic (mean or proportion) 
is best for this situation? 

2. When designing a study to determine this 
population proportion, what is the 
minimum number you would need to 
survey to be 90% confident that the 
population proportion is estimated to 
within 0.05? 

3. If it were later determined that it was 
important to be more than 90% confident, 
how would it affect the minimum number 
you need to survey? Why? Do not do any 
calculations. Suppose the marketing 
company did do the survey. They 


randomly surveyed 200 households and 
found that in 114 of them, the couple 
makes major household purchasing 
decisions together. A similar study from 
the 1980s found that 46.5% of couple 
made major household purchasing 
decisions together 

4. Conduct an eight-step hypothesis test to 
determine whether there has been a 
significant increase in the number of 
couples who make major household 
purchasing decisions together since the 
1980s. The editor of the magazine will 
only publish the article if there is ample 
evidence to support the claim. 

5. Construct a 95% confidence interval for 
the population proportion of couples who 
make major household purchasing 
decisions together. 

6. Interpret the confidence interval in the 
context of the question. 

7. If the rate has increased, use the 
confidence interval to determine by how 
much the rate has increased since the 
1980s. 

8. List two difficulties the company might 
have in obtaining random results, if this 
survey were done by email. 


6. Question 6: Suppose that an accounting firm 
has developed a new software to help their 


clients do their taxes more quickly. Based off of 
a national survey, most people spend 24.4 
hours completing their personal income taxes a 
year. The accounting firm has a random sample 
of 100 of their clients complete their 2016 
income tax return using the new software. The 
sample mean time to complete the tax returns 
is 23.6 hours with a standard deviation of 7.0 
hours. The firm doesn't want to release the 
software unless they are sure it will reduce the 
time it takes clients to do their taxes. The 
population distribution is assumed to be 
normal. 


1. What is the variable being studied? 
Categorize it. Based on this, what 
descriptive statistic (mean or proportion) 
is best for this situation? 

2. Conduct an appropriate eight-step 
hypothesis test to determine if, on 
average, the software has reduced the time 
it takes clients to do their taxes. 

3. Suppose the truth is that the software does 
help clients do their taxes faster. Has an 
error been committed? If so, what type of 
error is it? Explain your answers. 

4. Construct a 90% confidence interval for 
the population mean time to complete the 
tax forms. 

5. Interpret the confidence interval in the 
context of the question. 


6. Does the confidence interval support the 
results of the hypothesis test? Explain your 
answer. 

7. If the firm wished to increase its level of 
confidence and keep the margin of error 
the same by taking another survey, what 
changes should it make? Why? 

8. If the firm did another survey, kept the 
margin of error the same, and only 
surveyed 49 people, how would the level 
of confidence have to change? Why? 

9. Suppose that the firm decided that it 
needed to be at least 96% confident of the 
population mean length of time to within 
one hour. How would the number of 
people the firm surveys change? Why? 


7. Question 7: In 2013, it was determined that 
21% of North Americans download music 
illegally. Public Policy Polling is wondering 
whether that number has changed. They asked 
a random sample of adults across North 
America about their downloading habits. When 
asked, 512 of the 2247 participants admitted 
that they have illegally downloaded music. 


1. Has the proportion of North Americans 
who illegally download music increased 
since 2013? Conduct an appropriate eight- 
step hypothesis test to support your 
answer. 


2. Create and interpret a 99% confidence 
interval for the true proportion of North 
American adults who have illegally 
downloaded music. 

3. This survey was conducted through 
automated telephone interviews on May 6 
and 7 of this year. The margin of error of 
the survey compensates for sampling 
error, or natural variability among 
samples. List some factors that could affect 
the surveyOs outcome that are not covered 
by the margin of error. 

4. Without performing any calculations, 
describe how the confidence interval 
would change if the confidence level 
changed from 99% to 90%. 

5. Suppose Public Policy Polling want to 
conduct the study again now. They want 
to keep the same level of confidence as 
their last survey, but they want their 
results to within 2% of the true proportion 
of Canadian adults who have illegally 
downloaded music. What is the minimum 
sample size they need to obtain this? 


8. Question 8: A survey of the mean number of 
cents off that coupons give was conducted by 
randomly surveying one coupon per page from 
the coupon sections of a recent San Jose 
Mercury News. The following data were 
collected (in cents): 20; 75 ;50;65;30;55; 


40 ; 40; 30;55; 150; 40; 65; 40. Assume the 
underlying distribution is approximately 
normal. 


. What is the variable being studied? 


Categorize it. Based on this, what 
descriptive statistic (mean or proportion) 
is best for this situation? 


. Conduct an appropriate eight-step 


hypothesis test to determine if the mean 
number of cents off a coupon is different 
from 50 . Use a level of significance of 3%. 


. What is the probability of committing a 


type I error in the above hypothesis test? 


. Construct a 97% confidence interval for 


the population mean worth of coupons. 


. Interpret the confidence interval in the 


context of the question. 


. If many random samples were taken of 


size 14, what percent of the confidence 
intervals constructed should contain the 
population mean worth of coupons? 
Explain why. 


Solutions to Practice questions 


Ls 


1. The variable is the specific absorption rate. 


It is quantitative continuous data. The best 
descriptive statistic for this type of data is 
the mean. 


2. Since the sample size is less than 30, we 


can only assume the sampling distribution 
is normal if the population distribution is 
close to being normal. Based on the 
normal curve plot and the empirical rule, 
it appears that the sample is not normally 
distributed. The normal curve plot is not a 
straight line and only 55.6% of the data 
fall within the first standard deviation of 
this. This conclusion is supported by a 
bimodal histogram. This suggests that the 
population distribution is not normal, 
which means we cannot be certain the 
sampling distribution is normal. 
Regardless of your answer in b), assume 
that the sampling distribution is normal 
for the remaining questions. 


. State HO and HA . HO: on average, cell 
phones have SAR rates that are 0.9 watts 
per kg, u=0.9; HA: on average, cell 
phones have SAR rates that exceed 0.9 
watts per kg, u>0.9 Summarize the 
sample data. n= 27,X =0.989,s=0.410 
State and justify the model (or 
distribution) being used. Therefore, 
since we need to estimate the population 
standard deviation using the sample 
standard deviation, we will use the t-based 
mean model. 


¢ Sampling distribution of sample 


means is normal? Yes, as stated in the 
question. 

* Population standard deviation is 
known? No 


Choose an appropriate level of 
significance. Since the administrator is 
risk averse, they want to ensure that they 
have rejected HO with a lot of evidence. 
Therefore, the level of significance that 
requires the most evidence to reject HO is 
1%. If p<1%, reject HO. If p=>1%, do not 
reject HO. Calculate the test statistic and 
related p-value. Test stat: 1.128; 
p=0.1357 Discuss what the p-value 
measures in context. The probability that 
a sample mean SAR of at least 0.989 is 
observed, under the assumption that the 
SAR rate is 0.9, is 13.57%. Make a 
decision. Since p(13.57%) is greater than 
a(1%), we do not reject HO. Offer a 
concluding sentence. There is not 
sufficient evidence to suggest that, on 
average, cell phones have SAR rates that 
exceed 0.9 watts per kg, which means the 
advocacy group should not go forward 
with their petition. 


. Since a=1%, I will use a confidence level 
of 98% (for a one-tailed HT, use 1-2*alpha 
to determine complementary CL): 0.793 to 


1.18 

. We are 98% confident that the true 
population mean for SARs is somewhere 
between 0.793 watts/kg and 1.18 watts/ 
kg. 

. Though there are possible values for the 
population mean that do exceed 0.9 watts/ 
kg in the CI, there are also values that do 
not exceed 0.9 watt/kg. Therefore, the CI 
would lead to an inconclusive result, 
meaning it is not clear from the CI 
whether the pop. mean exceeds 0.9 or not. 
This aligns with our hypothesis test that 
there is not enough evidence to suggest 
that the population mean exceed 0.9 
watts/kg. 


. The variable is the emergency room wait 
times. It is quantitative continuous data. 
The best descriptive statistic for this type 
of data is the mean. 


. State HO and HA. HO: the average wait 
time for patients to be called back to be 
examined is 1.4 hours, u=1.4; HA: the 
average wait time for patients to be called 
back to be examined has changed from 1.4 
hours, 1+ 1.4 Summarize the sample 
data. n=70,X =1.5,s=0.5 State and 
justify the model (or distribution) being 
used. Therefore, we will use the t-based 


mean model. 


¢ Sampling distribution of sample 
means is normal? Yes as the sample 
size (70) is greater than 30, the 
central limit theorem applies and the 
sampling distribution of sample 
means is normally distributed. 

* Population standard deviation is 
known? No 


Choose an appropriate level of 
significance. As stated in the question, 
use 10% If p<10%, reject HO. If p=>10%, 
do not reject HO. Calculate the test 
statistic and related p-value. Test stat: 
1.673; p=0.0988 Discuss what the p- 
value measures in context. The 
probability (times 2) that a sample mean 
wait time of at least 1.5 hours is observed, 
under the assumption that the mean wait 
time is 1.4, is 9.88%. Make a decision. 
Since p(9.88%) is less than a(10%), we 
reject HO. Offer a concluding sentence. 
There is sufficient evidence to suggest that 
the average wait time for patients to be 
called back to be examined has changed 
from 1.4 hours. 


. Yes, if a=0.0988, we would change our 
decision to do not reject HO. 


. Yes. We have concluded that the mean has 
changed from 1.4, but the truth is that the 
mean has stayed the same. Therefore, we 
have made an error. As we have 
incorrectly rejected HO it is a type I error. 
. 1.402 to 1.598 

. We are 90% confident that the population 
average wait time in the emergency room 
is somewhere between 1.4 hours and 1.6 
hours. 

. If the level of confidence is increased then 
the critical value in the margin of error 
would increase. To keep the margin of 
error the same, either the standard 
deviation would need to decrease, or the 
sample size would need to decrease. As the 
standard deviation is inherent to the data, 
the sample size needs to decrease. 

. If the sample size increases, then the 
margin of error decreases. This means that 
to keep the margin of error constant, the 
level of confidence would need to 
increase. This would cause the critical 
value to be bigger which would 
compensate for the larger sample size. 

. They would need to interview at least 271 
patients. 


. The variable is the number of hours 
Americans spend watching TV. It is 
quantitative discrete data. The best 


descriptive statistic for this type of data is 
the mean. 


. State HO and HA. HO: on average, 
Americans are not meeting this 
recommendation, 1 = 150; HA: on average, 
Americans are meeting this 
recommendation, u< 150 Summarize the 
sample data. n= 25,X =149.64,0=32 
State and justify the model (or 
distribution) being used. As the 
population standard deviation is known, 
we will use the z-based mean model. 


¢ Sampling distribution of sample 
means is normal? Yes.The preamble 
states the the population is normally 
distributed. As the population 
distribution is assumed to be normal, 
we know the sampling distribution of 
sample means is also normal, even 
though the sample is less than 30. 
Population standard deviation is 
known? Yes 


Choose an appropriate level of 
significance. The level of significance is 
provided in the question. If p<5%, reject 
HO. If p=5%, do not reject HO. Calculate 
the test statistic and related p-value. 
Test stat: -0.056; p=0.4776 Discuss what 


the p-value measures in context. The 
probability that a sample mean number of 
hours of TV watched of at most 149.64 
hours is observed, under the assumption 
that the mean number of hours watching 
TV is 150, is 47.76%. Make a decision. 
Since p(47.76%) is greater than a(5%), we 
do not reject HO. Offer a concluding 
sentence. There is not sufficient evidence 
to suggest that, on average, Americans are 
meeting the recommendation of watching 
less than 150 hours of television per 
month. 


. 133.2 to 166.1 

. We are 99% confident that the population 
mean time that Americans spend watching 
TV is somewhere between 133.2 hours and 
166.1 hours. 

. The confidence level means that if we took 
many random samples of size 25 from the 
population of Americans and constructed 
many confidence intervals for each of 
these random samples, then 99% of these 
confidence intervals will contain the 
population mean time Americans spend 
watching TV, while 1% will not. 


. We know the sampling distribution for 
sample means is normal because the 
sample size is greater than 30 as stated in 


the Central Limit Theorem. Therefore, we 
use either the Student-t or the standard 
normal distributions. As the population 
standard deviation is known, we can use 
the standard normal distribution (i.e z- 
based normal distribution). 

. 239.84 to 248.16 

. The confidence interval will get narrower 
because the margin of error will be 
smaller. The margin of error is smaller 
because the amount of error between the 
sample means and the population mean is 
smaller as stated in the law of large 
numbers. 

. Yes, the estimated population mean 
weight of newborn elephants is 239.84 
pounds to 248.16 pounds. Based on this, it 
is fair to say that the average weight 
exceeds 235 pounds, as both bounds are 
larger than 235. 


. State HO and HA. HO: on average, 
newborn elephants weigh 235 pounds, 
U=235; HA: on average, newborn 
elephants weigh exceeds 235 pounds, 
u>235 Summarize the sample data. 
n=50,X =244,o=15 State and justify 
the model (or distribution) being used. 
As the population standard deviation is 
known, we will use the z-based mean 
model. 


* Sampling distribution of sample 
means is normal? Yes as the sample 
size (50) is greater than 30, the 
central limit theorem applies and the 
sampling distribution of sample 
means is normally distributed. 

* Population standard deviation is 
known? Yes 


Choose an appropriate level of 
significance. As the confidence level in 
the previous question was 95% and we are 
attempting to verify the CI with a HT, we 
should use an a of 2.5% (solve for alpha in 
0.95 = 1-2*alpha, for a one-tailed HT). If 
p<2.5%, reject HO. If p=>2.5%, do not 
reject HO. Calculate the test statistic and 
related p-value. Test stat: 4.24; 
p=1.10E-5=1.10 x 10-5=0.000011 
Discuss what the p-value measures in 
context. The probability that a sample 
mean weight of newborn elephants is at 
least 244 pounds is observed, under the 
assumption that the mean weight of 
newborn elephants is 235, is 0.0011%. 
Make a decision. Since p(0.0011%) is less 
than a(5%), we reject HO. Offer a 
concluding sentence. There is sufficient 
evidence to suggest that, on average, 
newborn elephants weigh exceeds 235 
pounds. 


. The variable is what whether a couple 
makes major household purchasing 
decisions together or not. It is categorical 
nominal data. The best descriptive statistic 
for this type of data is a proportion. 

. They would need to interview a minimum 
of 271 households (Note: As no estimate of 
the population proportion is provided, use 
50%) 

. If it were later determined that it was 
important to be more than 90% confident 
and a new survey were commissioned, 
how would it affect the minimum number 
you need to survey? Why? 


. State HO and HA . We define x in this 
problem to be the population proportion 
of couples who make a major household 
purchase together. HO: the proportion of 
couples who make major household 
purchasing decisions together is 
unchanged at 46.5%, m=0.465; HA: the 
proportion of couples who make major 
household purchasing decisions together is 
greater than 46.5%, m>0.465 Summarize 
the sample data. n= 200,X=114 State 
and justify the model (or distribution) 
being used. 


* Binomial distribution? Yes, because 
here are only two outcomes: Either 


couple makes household decisions 
together or they don't. 

* Sampling distributions of proportions 
normal? Yes, because number of 
successes (114) and number of 
failures (200-114 =86) are both at 
least 5. 


Choose an appropriate level of 
significance. As the editor needs strong 
evidence, need to choose a to be small, i.e. 
1%. If p<1%, reject HO. If p=>1%, do not 
reject HO. Calculate the test statistic and 
related p-value. Test stat: 2.977; 
p=0.00145 Discuss what the p-value 
measures in context. The probability that 
at least 114 out of 200 couples make 
major purchasing together, assuming the 
rate has not changed since the 1980s, is 
0.15%. Make a decision. Since p(0.19%) 
is less than a(1%), we reject HO. Offer a 
concluding sentence. There is sufficient 
evidence to suggest that the proportion of 
couples who make major household 
purchasing decisions together is greater 
than 46.5%. 


. 0.5014 to 0.6386 

. We are 95% confident that the true 
proportion of couples who make major 
household purchasing decisions together is 


somewhere between 50.14% and 63.86%. 
. Based off of the CI, the rate has increased 
by at least 3.6% and by at most 17.4%. 

. One issue is how will the marketing 
company develop the list of email 
addresses. Most likely they will not have a 
complete list of all emails for all 
households. Second of all, the email will 
be sent to a member of the household and 
not to the household as a whole. Thus one 
household may get multiple surveys. 
Further, not everyone uses email so the 
sample will miss those households. 


. The variable is the amount of time people 
take completing their tax forms. It is 
quantitative continuous data. The best 
descriptive statistic for this type of data is 
the mean. 

. Conduct an appropriate eight-step 
hypothesis test to determine if, on 
average, the software has reduced the time 
it takes clients to do their taxes. 


State HO and HA. HO: on average, the 
software has not reduced the time it takes 
clients to do their taxes, u= 24.4; HA: on 
average, the software has reduced the time 
it takes clients to do their taxes, u< 24.4 
Summarize the sample data. 

n=100,X =23.6,s=7.0 State and justify 


the model (or distribution) being used. 
Therefore, since we need to estimate the 
population standard deviation using the 
sample standard deviation but the sample 
size is large enough that there the 
difference between the z-based and t-based 
models is minimal, we will use the z-based 
mean model. 


¢ Sampling distribution of sample 
means is normal? Yes as the 
population distribution is assumed to 
be normal, we know the sampling 
distribution of sample means is also 
normal. 

* Population standard deviation is 
known? No 


Choose an appropriate level of 
significance. Since the firm doesn't want 
to release the software unless they are 
very confident that it works, they should 
choose a small level of significance (i.e. 
1%). If p< 1%, reject HO. If p=>1%, do not 
reject HO. Calculate the test statistic and 
related p-value. Test stat: -1.14; 
p=0.1279 Discuss what the p-value 
measures in context. The probability that 
a sample mean time to complete tax 
returns of at most 23.6 hours is observed, 
under the assumption that the mean time 


is 24.4, is 12.79%. Make a decision. Since 
p(12.79%) is greater than a(1%), we do 
not reject HO. Offer a concluding 
sentence. There is not sufficient evidence 
to suggest that, on average, the software 
has reduced the time it takes clients to do 
their taxes. 


. Since we have stated that it is that there is 
not enough evidence that the software has 
reduced the time it takes clients to do their 
taxes, when in fact it has, we have 
committed a type II error. 

. 22.45 to 24.75 

. We are 90% confident that the true 
average time it takes for people to 
complete their tax forms with this new 
software is somewhere between 22.45 
hours and 24.75 hours. 

. The HT has led us to state that there is 
evidence that the average time has not 
been reduced from 24.4. The CI supports 
this as it contains the population mean of 
24.4 hours. 

. If the level of confidence is increased then 
the critical value in the margin of error 
would increase. To keep the margin of 
error the same, either the standard 
deviation would need to decrease, or the 
sample size would need to increase. As the 
standard deviation is inherent to the data, 


the sample size needs to increase. 

. If the sample size decreases, then the 
margin of error increases. This means that 
to keep the margin of error constant, the 
level of confidence would need to 
decrease. This would cause the critical 
value to be smaller which would 
compensate for the smaller sample size. 

. It would not change the number of people 
needed to be interviewed. The level of 
confidence and the sample size are 
independent of each other. 


. State HO and HA. HO: the proportion of 
North Americans who illegally download 
music not increased since 2013, 
mzt=0.21HA: the proportion of North 
Americans who illegally download music 
increased since 2013, >0.21 Summarize 
the sample data. n= 2247,X =512 State 
and justify the model (or distribution) 
being used. Since this is a one population 
proportions test, we want to use the 
standard normal distribution to model it. 
We need to check two things: Therefore, 
we will use the standard normal 
distribution to model this situation. 


* Binomial distribution? Yes, because 
there are only two outcomes: Either 
person illegally downloads music or 


they don't. 

* Is the sampling distribution of sample 
proportions normal? Yes, as the 
number of successes (512) and the 
number of failures (2247-512 
= 1735) are both at least 5. 


Choose an appropriate level of 
significance. As there is no motivation 
stated in the study, I will choose a level of 
significance that is a balance between 
rejecting and not rejecting HO, i.e. 5%. If 
p<5%, reject HO. If p=>5%, do not reject 
HO. Calculate the test statistic and 
related p-value. Test stat: 2.08; 
p=0.0188 Discuss what the p-value 
measures in context. The probability that 
at least 512 out of 2247 North Americans 
admit that they have downloaded music 
illegally, assuming the rate has not 
changed since 2013, is 1.88%. Make a 
decision. Since p(1.88%) is less than 
a(5%), we reject HO. Offer a concluding 
sentence. There is sufficient evidence to 
suggest that the proportion of North 
Americans who illegally download music 
increased since 2013. 


. We are 99% confident that the true 
proportion of Canadians that download 
music illegally is somewhere between 


20.51% and 25.07%. 

. Some people may not want to admit to 
having downloaded music illegally. It is 
unclear how PPP got the list of phone 
numbers. This list could miss cell phone 
users and thus would not be 
representative. 

. The confidence interval would get 
narrower. 

. 2919 


. The variable is the number of cents off 
that coupons give. It is quantitative 
discrete data. The best descriptive statistic 
for this type of data is the mean. 


. State HO and HA. HO: the mean number 
of cents off a coupon is the same as 50 , 
u=50; HA: the mean number of cents off a 
coupon is different from 50 , u#50 
Summarize the sample data. 

n=14,X =53.93,s= 31.63 State and 
justify the model (or distribution) being 
used. Therefore, since we need to estimate 
the population standard deviation using 
the sample standard deviation, we will use 
the t-based mean model. 


¢ Sampling distribution of sample 
means is normal? Yes as the 
population distribution is assumed to 


be normal, we know the sampling 
distribution of sample means is also 
normal. 

* Population standard deviation is 
known? No 


Choose an appropriate level of 
significance. Level of significance in the 
question is stated to be 3%. If p<3%, 
reject HO. If p=>3%, do not reject HO. 
Calculate the test statistic and related 
p-value. Test stat = 0.465; p=0.6499 
Discuss what the p-value measures in 
context. The probability (times 2) that a 
sample mean number of cents off a coupon 
of at most 53.929 is observed, under the 
assumption that the mean number of cents 
is 50, is 64.99%. Make a decision. Since 
p(64.99%) is greater than a(3%), we do 
not reject HO. Offer a concluding 
sentence. There is not sufficient evidence 
to suggest that the mean number of cents 
off a coupon is different from 50 . 


. It is the level of significance, 3%. 

. 33.335 to 74.522 

. We are 97% confident that the mean 
number of cents off that coupons give is 
somewhere between 33.3 and 74.5. 

. 97% of them would contain the population 
mean, while 3% would not. This is 


determined by the confidence level. 


Introduction -- Linear Regression and Correlation -- 
MtRoyal - Version2016RevA 

class ="introduction" Linear regression and 
correlation can help you determine if an auto 
mechanic’s salary is related to his work experience. 
(credit: Joshua Rothhaas) 


Chapter Objectives 
By the end of this chapter, the student should be 
able to: 


Discuss basic ideas of linear regression and 


correlation. 

Create and interpret a line of best fit. 
Calculate and interpret the correlation 
coefficient. 

Calculate and interpret outliers. 


Professionals often want to know how two or more 
numeric variables are related. For example, is there 
a relationship between the grade on the second 
math exam a student takes and the grade on the 
final exam? If there is a relationship, what is the 
relationship and how strong is it? 


In another example, your income may be 
determined by your education, your profession, your 
years of experience, and your ability. The amount 
you pay a repair person for labor is often 
determined by an initial amount plus an hourly fee. 


The type of data described in the examples is 
bivariate data — "bi" for two variables. In reality, 
statisticians use multivariate data, meaning many 
variables. 


In this chapter, you will be studying the simplest 
form of regression, "linear regression" with one 
independent variable (x). This involves data that fits 
a line in two dimensions. You will also study 
correlation which measures how strong the 
relationship is. 


Scatter Plots -- Linear Regression and Correlation -- 
MtRoyal - Version2016RevA 


Before we take up the discussion of linear regression 
and correlation, we need to examine a way to 
display the relation between two variables x and y. 
The most common and easiest way is a scatter plot. 
The following example illustrates a scatter plot. 


In Europe and Asia, m-commerce is popular. M- 
commerce users have special mobile phones that 
work like electronic wallets as well as provide 
phone and Internet services. Users can do 
everything from paying for parking to buying a TV 
set or soda from a machine to banking to checking 
sports scores on the Internet. For the years 2000 
through 2004, was there a relationship between 
the year and the number of m-commerce users? 
Construct a scatter plot. Let x = the year and let y 
= the number of m-commerce users, in millions. 
Table showing the number of m-commerce users 
(in millions) by ate east plot showing the 
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x = year 


To create a scatter plot: 


le 


I, 


Enter your X data into list L1 and your Y data 
into list L2. 

Press 2nd STATPLOT ENTER to use Plot 1. On 
the input screen for PLOT 1, highlight On and 
press ENTER. (Make sure the other plots are 
OFF.) 


. For TYPE: highlight the very first icon, which 


is the scatter plot, and press ENTER. 


. For Xlist:, enter L1 ENTER and for Ylist: L2 


ENTER. 


. For Mark: it does not matter which symbol 


you highlight, but the square is the easiest to 
see. Press ENTER. 


. Make sure there are no other equations that 


could be plotted. Press Y = and clear any 


equations out. 

7. Press the ZOOM key and then the number 9 
(for menu item "ZoomStat") ; the calculator 
will fit the window to the data. You can press 
WINDOW to see the scaling of the axes. 


Try It 


Amelia plays basketball for her high school. 
She wants to improve to play at the college 
level. She notices that the number of points 
she scores in a game goes up in response to the 
number of hours she practices her jump shot 
each week. She records the following data: 


X (hours practicing Y (points scored in a 
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Construct a scatter plot and state if what 
Amelia thinks appears to be true. 


Yes, Amelia’s assumption appears to be 
correct. The number of points Amelia scores 
per game goes up when she practices her jump 
shot more. 


A scatter plot shows the direction of a relationship 
between the variables. A clear direction happens 
when there is either: 


* High values of one variable occurring with high 
values of the other variable or low values of 
one variable occurring with low values of the 
other variable. 

* High values of one variable occurring with low 
values of the other variable. 


You can determine the strength of the relationship 
by looking at the scatter plot and seeing how close 
the points are to a line, a power function, an 
exponential function, or to some other type of 
function. For a linear relationship there is an 
exception. Consider a scatter plot where all the 
points fall on a horizontal line providing a "perfect 
fit." The horizontal line would in fact show no 
relationship. 


When you look at a scatterplot, you want to notice 
the overall pattern and any deviations from the 
pattern. The following scatterplot examples 
illustrate these concepts. 


(a) Negative linear pattern (strong) (b) Negative linear pattern (weak) 


(a) Exponential growth pattern (b) No pattern 


In this chapter, we are interested in scatter plots 
that show a linear pattern. Linear patterns are quite 
common. The linear relationship is strong if the 
points are close to a straight line, except in the case 
of a horizontal line where there is no relationship. If 
we think that the points show a linear relationship, 
we would like to draw a line on the scatter plot. 
This line can be calculated through a process called 
linear regression. However, we only calculate a 
regression line if one of the variables helps to 
explain or predict the other variable. If x is the 
independent variable and y the dependent variable, 
then we can use a regression line to predict y for a 
given value of x 


Chapter Review 


Scatter plots are particularly helpful graphs when 
we want to see if there is a linear relationship 
among data points. They indicate both the direction 
of the relationship between the x variables and the y 
variables, and the strength of the relationship. We 
calculate the strength of the relationship between an 


independent variable and a dependent variable 
using linear regression. 


Does the scatter plot appear linear? Strong or 
weak? Positive or negative? 


y 
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The data appear to be linear with a strong, 
positive correlation. 


Does the scatter plot appear linear? Strong or 
weak? Positive or negative? 
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Does the scatter plot appear linear? Strong or 
weak? Positive or negative? 
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The data appear to have no correlation. 


Homework 


The Gross Domestic Product Purchasing Power 
Parity is an indication of a country’s currency 
value compared to another country. [link] 
shows the GDP PPP of Cuba as compared to US 
dollars. Construct a scatter plot of the data. 
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Check student’s solution. 


The following table shows the poverty rates and 
cell phone usage in the United States. Construct 
a scatter plot of the data 


Year Poverty Rate Cellular Usage 
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Does the higher cost of tuition translate into 
higher-paying jobs? The table lists the top ten 
colleges based on mid-career salary and the 
associated yearly tuition costs. Construct a 
scatter plot of the data. 


School Mid-Career Yearly Tuition 
Salary (in 
tm ousanasy 
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For graph: check student’s solution. Note that 
tuition is the independent variable and salary is 
the dependent variable. 


If the level of significance is 0.05 and the p- 
value is 0.06, what conclusion can you draw? 


If there are 15 data points in a set of data, what 
is the number of degree of freedom? 
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Linear Equations -- Linear Regression and 
Correlation -- MtRoyal - Version2016RevA 


Linear regression for two variables is based on a 
linear equation with one independent variable. The 
equation has the form: 

y =a-+t bx 


where a and b are constant numbers. 


The variable x is the independent variable, and y 
is the dependent variable. Typically, you choose a 
value to substitute for the independent variable and 
then solve for the dependent variable. 


The following examples are linear equations. 
=3+2x 
=-0.01+1.2x 


Try It 


Is the following an example of a linear 
equation? 


y = -0.125 - 3.5x 


The graph of a linear equation of the form y = a + 
bx is a straight line. Any line that is not vertical 
can be described by this equation. 


Graph the equation y = -1 + 2x. 


Try It 


Is the following an example of a linear 
equation? Why or why not? 


No, the graph is not a straight line; therefore, 
it is not a linear equation. 


Aaron's Word Processing Service (AWPS) does 
word processing. The rate for services is $32 per 
hour plus a $31.50 one-time charge. The total cost 
to a customer depends on the number of hours it 


takes to complete the job. 


Find the equation that expresses the total cost 
in terms of the number of hours required to 


complete the job. 


Let x = the number of hours it takes to get the 
job done. 


Let y = the total cost to the customer. 


The $31.50 is a fixed cost. If it takes x hours to 
complete the job, then (32)(x) is the cost of 
the word processing only. The total cost is: y 

= 31-50 = 32x 


Try It 


Emma’s Extreme Sports hires hang-gliding 
instructors and pays them a fee of $50 per 
class as well as $20 per student in the class. 
The total cost Emma pays depends on the 


number of students in a class. Find the 
equation that expresses the total cost in terms 
of the number of students in a class. 


Three possible graphs of y = a + bx. (a) If b > 0, 
the line slopes upward to the right. (b) If b = 0, the 
line is horizontal. (c) If b < 0, the line slopes 
downward to the right. 


Slope and Y-Intercept of a Linear 
Equation 


For the linear equation y = a + bx, b = slope anda 
= y-intercept. From algebra recall that the slope is a 
number that describes the steepness of a line, and 
the y-intercept is the y coordinate of the point (0, a) 
where the line crosses the y-axis. 


(a) (b) (c) 


Svetlana tutors to make extra money for college. 
For each tutoring session, she charges a one-time 
fee of $25 plus $15 per hour of tutoring. A linear 
equation that expresses the total amount of money 
Svetlana earns for each session she tutors is y = 25 
stall os: 


What are the independent and dependent 
variables? What is the y-intercept and what is 
the slope? Interpret them using complete 
sentences. 


The independent variable (x) is the number of 
hours Svetlana tutors each session. The 
dependent variable (y) is the amount, in 


dollars, Svetlana earns for each session. 


The y-intercept is 25 (a = 25). At the start of 
the tutoring session, Svetlana charges a one- 
time fee of $25 (this is when x = 0). The slope 
is 15 (b = 15). For each session, Svetlana 
earns $15 for each hour she tutors. 


Try It 


Ethan repairs household appliances like 
dishwashers and refrigerators. For each visit, 
he charges $25 plus $20 per hour of work. A 
linear equation that expresses the total amount 
of money Ethan earns per visit is y = 25 + 
20x. 


What are the independent and dependent 
variables? What is the y-intercept and what is 
the slope? Interpret them using complete 
sentences. 


The independent variable (x) is the number of 
hours Ethan works each visit. The dependent 
variable (y) is the amount, in dollars, Ethan 
earns for each visit. 


The y-intercept is 25 (a = 25). At the start of a 


visit, Ethan charges a one-time fee of $25 (this 
is when x = OQ). The slope is 20 (b = 20). For 
each visit, Ethan earns $20 for each hour he 
works. 


References 


Data from the Centers for Disease Control and 
Prevention. 


Data from the National Center for HIV, STD, and TB 
Prevention. 


Chapter Review 


The most basic type of association is a linear 
association. This type of relationship can be defined 
algebraically by the equations used, numerically 
with actual or predicted data values, or graphically 
from a plotted curve. (Lines are classified as straight 
curves.) Algebraically, a linear equation typically 
takes the form y = mx + b, where m and b are 
constants, x is the independent variable, y is the 
dependent variable. In a statistical context, a linear 
equation is written in the form y = a + bx, where 


a and b are the constants. This form is used to help 
readers distinguish the statistical context from the 
algebraic context. In the equation y = a + bx, the 
constant b that multiplies the x variable (6 is called 
a coefficient) is called as the slope. The slope 
describes the rate of change between the 
independent and dependent variables; in other 
words, the rate of change describes the change that 
occurs in the dependent variable as the independent 
variable is changed. In the equation y = a + bx, the 
constant a is called as the y-intercept. Graphically, 
the y-intercept is the y coordinate of the point 
where the graph of the line crosses the y axis. At 
this point x = 0. 


The slope of a line is a value that describes the rate 
of change between the independent and dependent 
variables. The slope tells us how the dependent 
variable (y) changes for every one unit increase in 
the independent (x) variable, on average. The y- 
intercept is used to describe the dependent variable 
when the independent variable equals zero. 
Graphically, the slope is represented by three line 
types in elementary statistics. 


Formula Review 


y = a+ bx where a is the y-intercept and b is the 
slope. The variable x is the independent variable 
and y is the dependent variable. 


Use the following information to answer the next three 
exercises. A vacation resort rents SCUBA equipment 
to certified divers. The resort charges an up-front fee 
of $25 and another fee of $12.50 an hour. 


What are the dependent and independent 
variables? 


dependent variable: fee amount; independent 
variable: time 


Find the equation that expresses the total fee in 
terms of the number of hours the equipment is 
rented. 


Graph the equation from [link]. 


Use the following information to answer the next two 
exercises. A credit card company charges $10 when a 
payment is late, and $5 a day each day the payment 
remains unpaid. 


Find the equation that expresses the total fee in 
terms of the number of days the payment is 
late. 


Graph the equation from [link]. 


Is the equation y = 10 + 5x —- 3x2 linear? Why 
or why not? 


Which of the following equations are linear? 


ay=6x+8 


by + 7 = 3x 
c.y-x = 8x2 
d.4y =8 


y = 6x + 8, 4y = 8 andy + 7 = 3xare all 
linear equations. 


Does the graph show a linear equation? Why or 
why not? 
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[link] contains real data for the first two decades of 
AIDS reporting. 


# AIDS cases # AIDS deaths 
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Adults and Adolescents only, United States 


Use the columns "year" and "# AIDS cases 


diagnosed. Why is “year” the independent 


variable and “# AIDS cases diagnosed.” the 
dependent variable (instead of the reverse)? 


The number of AIDS cases depends on the year. 
Therefore, year becomes the independent 
variable and the number of AIDS cases is the 
dependent variable. 


Use the following information to answer the next two 
exercises. A specialty cleaning company charges an 
equipment fee and an hourly labor fee. A linear 
equation that expresses the total amount of the fee 
the company charges for each session is y = 50 + 
100x. 


What are the independent and dependent 
variables? 


What is the y-intercept and what is the slope? 
Interpret them using complete sentences. 


The y-intercept is 50 (a = 50). At the start of 
the cleaning, the company charges a one-time 
fee of $50 (this is when x = 0). The slope is 
100 (b = 100). For each session, the company 
charges $100 for each hour they clean. 


Use the following information to answer the next three 
questions. Due to erosion, a river shoreline is losing 
several thousand pounds of soil each year. A linear 
equation that expresses the total amount of soil lost 
per year is y = 12,000x. 


What are the independent and dependent 
variables? 


How many pounds of soil does the shoreline 
lose in a year? 


12,000 pounds of soil 


What is the y-intercept? Interpret its meaning. 


Use the following information to answer the next two 
exercises. The price of a single issue of stock can 
fluctuate throughout the day. A linear equation that 
represents the price of stock for Shipment Express is 
y = 15-1.5x where x is the number of hours 
passed in an eight-hour day of trading. 


What are the slope and y-intercept? Interpret 
their meaning. 


The slope is -1.5 (b = -1.5). This means the 
stock is losing value at a rate of $1.50 per hour. 
The y-intercept is $15 (a = 15). This means the 
price of stock before the trading day was $15. 


If you owned this stock, would you want a 
positive or negative slope? Why? 


Homework 


For each of the following situations, state the 
independent variable and the dependent 
variable. 


1. A study is done to determine if elderly 
drivers are involved in more motor vehicle 
fatalities than other drivers. The number of 
fatalities per 100,000 drivers is compared 
to the age of drivers. 

2. A study is done to determine if the weekly 
grocery bill changes based on the number 
of family members. 

3. Insurance companies base life insurance 
premiums partially on the age of the 
applicant. 

4. Utility bills vary according to power 
consumption. 


5. A study is done to determine if a higher 
education reduces the crime rate in a 
population. 


1. independent variable: age; dependent 
variable: fatalities 

2. independent variable: # of family 
members; dependent variable: grocery bill 

3. independent variable: age of applicant; 
dependent variable: insurance premium 

4. independent variable: power consumption; 
dependent variable: utility 

5. independent variable: higher education 
(years); dependent variable: crime rates 


Piece-rate systems are widely debated incentive 
payment plans. In a recent study of loan officer 
effectiveness, the following piece-rate system 
was examined: 


% of goal < 80 80 100 120 


Incentive n/a $4,000 $6,500 $9,500 
with ar. withar with an 


additionahdditionakdditional 
$125 $125 $125 
added added added 

per per per 
percentagpercentagpercentage 
point point point 


from 8_- from starting 
99% 101- at 121% 
119% 


If a loan officer makes 95% of his or her goal, 
write the linear function that applies based on 
the incentive plan table. In context, explain the 
y-intercept and slope. 


